In my current role as an AI architect, I've faced a recurring challenge, how do we effectively manage multiple Large Language Models (LLMs) when several teams are building AI agents? With so many models available, each with different strengths, it becomes crucial to use the right model for the right task. Our solution has been implementing Model Routers and Gateways – and the results have been transformative.

The Challenge of Managing Multiple LLMs

When you're working with several different LLMs across an organization, things can quickly become confusing. Teams might default to using a single familiar model for everything, even when it's not the optimal choice. This leads to inefficiency, higher costs, and sometimes poor performance. We needed a system to intelligently direct queries to the most appropriate model based on the specific requirements of each task.

Enter Model Routers: Using the Right Tool for the Right Job

A model router essentially acts as a traffic director for your AI queries. Instead of sending everything to a single model, it analyzes the query and routes it to the most suitable LLM based on several factors:

Technical Queries

When a user needs code generation or debugging help, we route these to technically proficient models like OpenAI's GPT. These specialized models excel at tasks requiring precise understanding of programming languages and technical concepts.[ Anthropic is good Choice for this type of query , however we are using GPT models]

General Queries

For everyday questions that don't involve sensitive information or complex reasoning, we use cost-effective options like DEEPSEEK. Being locally hosted, it offers good performance for standard queries without the higher costs of premium API-based models. This helps to keep our company data within our premises since query may retrieve employee personal information and other polices .

Creative Requests

Tasks like story writing or brainstorming need models with larger context windows and strong creative capabilities. For these, we route to Google Gemini, which can maintain coherence across longer, more nuanced creative outputs.

Unknown Queries

Not every question has an answer, and that's okay. When a query falls outside our models' capabilities, our system is designed to politely acknowledge its limitations rather than providing incorrect information.

Model Gateways: The Integration Layer

While routers direct traffic, model gateways serve as the unified interface between users and various LLMs. Think of the gateway as the central hub that connects everything together, providing several critical benefits:

Centralized Access Control

The gateway creates a single point of access for all models, allowing for better management of API keys and access tokens. This reduces security risks while simplifying administration.

Intelligent Cost Management

Different models have different pricing structures. The gateway helps manage costs by directing queries to more expensive models only when their capabilities justify the expense, while routing general queries to more affordable options.

Built-in Redundancy

API services occasionally experience downtime or rate limiting. A well-designed gateway includes fallback mechanisms, automatically rerouting requests to alternative models when necessary to maintain continuous service.

Performance Optimization

The centralized nature of the gateway makes it ideal for implementing load balancing, caching frequent responses, and monitoring overall system performance.

Security and Usage Insights

With all traffic flowing through a single point, the gateway provides comprehensive logging for security audits and usage analytics, helping inform future resource allocation decisions.

The Impact on Our Organization

Implementing this router-gateway architecture has dramatically improved how we manage our AI systems. Teams no longer need to worry about which model to use for which task – the system handles that automatically. We've seen reduced costs, improved response quality, and better overall user experiences.

The system has also provided unexpected benefits in terms of scalability. As new models become available, we can easily integrate them into our architecture without disrupting existing workflows. This future-proofs our AI infrastructure as the LLM landscape continues to evolve. If you're managing AI systems across teams or applications, consider implementing a similar router-gateway architecture to make your systems more efficient, cost-effective, and scalable.

The right model for the right job, at the right time, with the right controls – that's the foundation of a truly effective multi-model AI system.

Router and Gateway Implementation

"""
AI Model Gateway Service

This FastAPI application serves as a gateway for multiple AI models, providing a unified interface
to interact with different AI services (OpenAI, Gemini, and DeepSeek). It includes intelligent
routing capabilities to direct queries to the most appropriate model based on intent and requirements.

Key Features:
- Unified API interface for multiple AI models
- Intelligent query routing based on intent classification
- Support for OpenAI (Azure), Google Gemini, and DeepSeek models
- Health monitoring and model availability checks
- Test endpoints for each supported model
- Centralized access control and API key management
- Cost management and usage tracking
- Failover and redundancy mechanisms
- Security and audit logging

Environment Variables Required:
- AZURE_OPENAI_ENDPOINT: Azure OpenAI service endpoint
- AZURE_OPENAI_API_KEY: Azure OpenAI API key
- AZURE_OPENAI_API_VERSION: Azure OpenAI API version
- AZURE_OPENAI_DEPLOYMENT_NAME: Azure OpenAI deployment name
- GEMINI_API_KEY: Google Gemini API key
- DEEPSEEK_URL: DeepSeek API endpoint URL
- JWT_SECRET: Secret key for JWT authentication

Author: Sreeni Ramadurai 
Date: 2025-03-06
Version: 1.0.0
"""

# 1. Imports
from fastapi import FastAPI, HTTPException, Request, Depends, Security, Query
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Dict, Any, Optional, Literal, List
from enum import Enum
from datetime import datetime, timedelta
from functools import lru_cache
import os
import json
import jwt
import time
import logging
import asyncio
import uuid
import requests
import google.generativeai as genai
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI

# 2. Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# 3. Load environment variables
load_dotenv()

# 4. Initialize FastAPI app
app = FastAPI(
    title="AI Model Gateway",
    description="A gateway service for different AI models with advanced features",
    version="1.0.0"
)

# 5. Add CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production, replace with specific origins
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# 6. Security settings
security = HTTPBearer()
JWT_SECRET = os.getenv("JWT_SECRET", "your-secret-key")
ACCESS_TOKEN_EXPIRE_MINUTES = 30
API_USERS = {
    "admin": "admin123",  # In production, use hashed passwords and a database
    "user": "user123"
}

# 7. Model costs
MODEL_COSTS = {
    "gpt-4o-mini": 0.0001,
    "gemini-1.5-pro": 0.00005,
    "deepseek-r1": 0.00002
}

# 8. Enums
class QueryIntent(str, Enum):
    """
    Enumeration of possible query intents for classification.

    Attributes:
        TECHNICAL: Technical support and system-related queries
        GENERAL: General information and non-specific queries
        HR: Human Resources related queries
        OUT_OF_SCOPE: Queries that are not supported by the system
    """
    TECHNICAL = "technical"
    GENERAL = "general"
    HR = "hr"
    OUT_OF_SCOPE = "out_of_scope"

class ModelCapability(str, Enum):
    """
    Enumeration of model capabilities for routing decisions.

    Attributes:
        TECHNICAL: Ability to handle technical queries
        GENERAL: Ability to handle general queries
        HR: Ability to handle HR-related queries
        LONG_CONTEXT: Ability to handle long context inputs
        COST_EFFICIENT: Cost-effective model option
    """
    TECHNICAL = "technical"
    GENERAL = "general"
    HR = "hr"
    LONG_CONTEXT = "long_context"
    COST_EFFICIENT = "cost_efficient"

# 9. Pydantic Models
class RouterConfig(BaseModel):
    """
    Configuration for model routing based on intent.

    Attributes:
        intent: The classified intent of the query
        model_type: Type of model to use (openai, gemini, deepseek)
        model_name: Specific model name/version
        max_tokens: Maximum tokens for the model
        temperature: Model temperature setting
        capabilities: List of model capabilities
    """
    intent: QueryIntent
    model_type: str
    model_name: str
    max_tokens: int
    temperature: float = 0.7
    capabilities: List[ModelCapability]

class RouterRequest(BaseModel):
    """
    Request model for the router endpoint.

    Attributes:
        input_data: The text input to process
        context_length: Length of context in tokens
        cost_sensitive: Whether to prioritize cost efficiency
    """
    input_data: str = Field(..., description="Input text to process")
    context_length: Optional[int] = Field(default=1000, description="Length of the context in tokens")
    cost_sensitive: Optional[bool] = Field(default=False, description="Whether to prioritize cost efficiency")

class RouterResponse(BaseModel):
    """
    Response model for the router endpoint.

    Attributes:
        intent: The classified intent
        model_type: Selected model type
        model_name: Selected model name
        confidence: Classification confidence score
        explanation: Explanation of the classification
    """
    intent: QueryIntent
    model_type: str
    model_name: str
    confidence: float
    explanation: str

class ModelResponse(BaseModel):
    """
    Response model for model endpoints.

    Attributes:
        status: Success or error status
        model: Model used for processing
        response: Model's response text
        error: Error message if any
    """
    status: str
    model: str
    response: Optional[str] = None
    error: Optional[str] = None

class TestRequest(BaseModel):
    """
    Request model for test endpoints.

    Attributes:
        input_data: Test input text
        max_tokens: Maximum tokens for generation
    """
    input_data: str = Field(..., description="Input text to process")
    max_tokens: Optional[int] = Field(default=1000, description="Maximum number of tokens to generate")

class ModelRequest(BaseModel):
    """
    Request model for direct model access.

    Attributes:
        input_data: Input text to process
        model_type: Type of model to use
        model_name: Specific model name
        max_tokens: Maximum tokens for generation
    """
    input_data: str = Field(..., description="Input text to process")
    model_type: str = Field(..., description="Type of model to use (openai, gemini, deepseek)")
    model_name: str = Field(..., description="Name of the specific model to use")
    max_tokens: Optional[int] = Field(default=1000, description="Maximum number of tokens to generate")

class HealthResponse(BaseModel):
    """
    Response model for health check endpoint.

    Attributes:
        status: Overall system status
        models: Dictionary of model availability status
    """
    status: str
    models: Dict[str, bool]

class UserAuth(BaseModel):
    """
    Authentication request model.

    Attributes:
        username: User's username
        password: User's password
    """
    username: str
    password: str

class Token(BaseModel):
    """
    Authentication token response model.

    Attributes:
        access_token: JWT access token
        token_type: Type of token (bearer)
    """
    access_token: str
    token_type: str

# 10. Router Configurations
ROUTER_CONFIGS = {
    QueryIntent.TECHNICAL: RouterConfig(
        intent=QueryIntent.TECHNICAL,
        model_type="openai",
        model_name="gpt-4o-mini",
        max_tokens=1001,
        capabilities=[ModelCapability.TECHNICAL]
    ),
    QueryIntent.GENERAL: RouterConfig(
        intent=QueryIntent.GENERAL,
        model_type="gemini",
        model_name="gemini-1.5-pro",
        max_tokens=4001,
        capabilities=[ModelCapability.GENERAL, ModelCapability.LONG_CONTEXT]
    ),
    QueryIntent.HR: RouterConfig(
        intent=QueryIntent.HR,
        model_type="deepseek",
        model_name="deepseek-r1",
        max_tokens=4001,
        capabilities=[ModelCapability.HR]
    ),
}

# 11. Helper Functions
async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    """
    Verify JWT token and return payload.

    Args:
        credentials: HTTP authorization credentials containing the JWT token

    Returns:
        dict: Decoded JWT payload

    Raises:
        HTTPException: If token is invalid
    """
    try:
        payload = jwt.decode(credentials.credentials, JWT_SECRET, algorithms=["HS256"])
        return payload
    except jwt.InvalidTokenError:
        raise HTTPException(
            status_code=401,
            detail="Invalid authentication token"
        )

def create_access_token(data: dict, expires_delta: Optional[timedelta] = None) -> str:
    """
    Create a new JWT access token.

    Args:
        data: Data to encode in the token
        expires_delta: Token expiration time

    Returns:
        str: Encoded JWT token
    """
    to_encode = data.copy()
    if expires_delta:
        expire = datetime.utcnow() + expires_delta
    else:
        expire = datetime.utcnow() + timedelta(minutes=15)
    to_encode.update({"exp": expire})
    encoded_jwt = jwt.encode(to_encode, JWT_SECRET, algorithm="HS256")
    return encoded_jwt

async def test_model_connectivity(model: str) -> bool:
    """
    Test connectivity to a specific model.

    Args:
        model: Name of the model to test

    Returns:
        bool: True if model is accessible, False otherwise
    """
    return True

# 12. Model Handlers
async def openai_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse:
    """
    Handle requests to OpenAI model with retry logic.

    Args:
        input_data: Input text to process
        model_name: Name of the OpenAI model to use
        max_tokens: Maximum tokens for generation

    Returns:
        ModelResponse: Response from the model
    """
    max_retries = 3
    retry_delay = 1

    for attempt in range(max_retries):
        try:
            start_time = time.time()
            client = AzureChatOpenAI(
                azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview"),
                azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")
            )
            response = await client.ainvoke(input_data)
            return ModelResponse(status="success", model=model_name, response=response.content)
        except Exception as e:
            if attempt == max_retries - 1:
                logger.error(f"OpenAI model error after {max_retries} attempts: {str(e)}")
                return ModelResponse(status="error", model=model_name, error=str(e))
            await asyncio.sleep(retry_delay * (attempt + 1))

async def gemini_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse:
    """
    Handle requests to Google Gemini model.

    Args:
        input_data: Input text to process
        model_name: Name of the Gemini model to use
        max_tokens: Maximum tokens for generation

    Returns:
        ModelResponse: Response from the model
    """
    try:
        genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
        model = genai.GenerativeModel(model_name)
        response = await model.generate_content_async(input_data)
        return ModelResponse(status="success", model=model_name, response=response.text)
    except Exception as e:
        return ModelResponse(status="error", model=model_name, error=str(e))

async def deepseek_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse:
    """
    Handle requests to DeepSeek model.

    Args:
        input_data: Input text to process
        model_name: Name of the DeepSeek model to use
        max_tokens: Maximum tokens for generation

    Returns:
        ModelResponse: Response from the model
    """
    try:
        deepseek_url = os.getenv("DEEPSEEK_URL", "http://localhost:8000/v1/chat/completions")
        payload = {
            "model": model_name,
            "messages": [{"role": "user", "content": input_data}],
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "stream": False
        }
        response = requests.post(deepseek_url, json=payload)
        response.raise_for_status()
        result = response.json()
        return ModelResponse(status="success", model=model_name, response=result["message"]["content"])
    except Exception as e:
        return ModelResponse(status="error", model=model_name, error=str(e))

# 13. Intent Classification
async def classify_intent(input_data: str, context_length: int, cost_sensitive: bool) -> RouterResponse:
    """
    Classify the intent of a query and determine the appropriate model.

    Args:
        input_data: Input text to classify
        context_length: Length of context in tokens
        cost_sensitive: Whether to prioritize cost efficiency

    Returns:
        RouterResponse: Classification result with model selection
    """
    try:
        # Try OpenAI first if available
        if os.getenv("AZURE_OPENAI_API_KEY"):
            try:
                client = AzureChatOpenAI(
                    azure_deployment="gpt-4o-mini",
                    api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15"),
                    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
                    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
                    temperature=0,
                )

                prompt = f"""Analyze the following query and determine its intent. Consider:

1. Is it a technical issue? (e.g., software problems, system errors, technical support)
2. Is it an HR-related query? Look for topics like:
   - Benefits (401k, health insurance, dental, vision)
   - Payroll and compensation
   - Leave policies (vacation, sick leave, FMLA)
   - Employee relations
   - Hiring and recruitment
   - Performance reviews
   - Training and development
   - Employee policies
   - Workplace accommodations
   - Employee handbook
   - HR forms and documents
3. Is it a general query? (e.g., general information, office locations, company policies)
4. Is it out of scope?

Query: {input_data}
Context length: {context_length} tokens
Cost sensitive: {cost_sensitive}

Respond in JSON format with:
- intent: one of ["technical", "general", "hr"]
- confidence: float between 0 and 1
- explanation: brief explanation of the classification
"""

                response = await client.ainvoke(prompt)
                result = response.content.strip()

                if result.startswith('```

json'):
                    result = result[7:]
                if result.endswith('

```'):
                    result = result[:-3]
                result = result.strip()

                classification = json.loads(result)
                config = ROUTER_CONFIGS[QueryIntent(classification["intent"])]

                if context_length > 2001:
                    config = ROUTER_CONFIGS[QueryIntent.GENERAL]  # Use Gemini for long context
                elif cost_sensitive and config.model_type != "gemini":
                    config = ROUTER_CONFIGS[QueryIntent.GENERAL]  # Use Gemini for cost efficiency

                return RouterResponse(
                    intent=QueryIntent(classification["intent"]),
                    model_type=config.model_type,
                    model_name=config.model_name,
                    confidence=classification["confidence"],
                    explanation=classification["explanation"]
                )
            except Exception as e:
                print(f"OpenAI classification error: {str(e)}")

        # Fallback to DeepSeek if OpenAI fails
        if os.getenv("DEEPSEEK_URL"):
            try:
                deepseek_url = os.getenv("DEEPSEEK_URL")
                payload = {
                    "model": "deepseek-chat",
                    "messages": [
                        {
                            "role": "user",
                            "content": f"""Analyze the following query and determine its intent. Consider:

1. Is it a technical issue? (e.g., software problems, system errors, technical support)
2. Is it an HR-related query? Look for topics like:
   - Benefits (401k, health insurance, dental, vision)
   - Payroll and compensation
   - Leave policies (vacation, sick leave, FMLA)
   - Employee relations
   - Hiring and recruitment
   - Performance reviews
   - Training and development
   - Employee policies
   - Workplace accommodations
   - Employee handbook
   - HR forms and documents
3. Is it a general query? (e.g., general information, office locations, company policies)
4. Is it out of scope?
5. Is it ambiguous?

Query: {input_data}
Context length: {context_length} tokens
Cost sensitive: {cost_sensitive}

Respond in JSON format with:
- intent: one of ["technical", "hr", "general", "out_of_scope", "ambiguous"]
- confidence: float between 0 and 1
- explanation: brief explanation of the classification"""
                        }
                    ],
                    "max_tokens": 500,
                    "temperature": 0.3,
                    "stream": False
                }

                response = requests.post(deepseek_url, json=payload)
                response.raise_for_status()
                result = response.json()

                classification = {
                    "intent": result["message"]["content"].strip(),
                    "confidence": 0.8,
                    "explanation": "Classified using DeepSeek model"
                }

                config = ROUTER_CONFIGS[QueryIntent(classification["intent"])]

                if context_length > 2001:
                    config = ROUTER_CONFIGS[QueryIntent.GENERAL]  # Use Gemini for long context
                elif cost_sensitive and config.model_type != "gemini":
                    config = ROUTER_CONFIGS[QueryIntent.GENERAL]  # Use Gemini for cost efficiency

                return RouterResponse(
                    intent=QueryIntent(classification["intent"]),
                    model_type=config.model_type,
                    model_name=config.model_name,
                    confidence=classification["confidence"],
                    explanation=classification["explanation"]
                )
            except Exception as e:
                print(f"DeepSeek classification error: {str(e)}")

        return RouterResponse(
            intent=QueryIntent.GENERAL,
            model_type="gemini",
            model_name="gemini-1.5-pro",
            confidence=0.5,
            explanation="Fallback to general model due to classification errors"
        )
    except Exception as e:
        print(f"Error in intent classification: {str(e)}")
        return RouterResponse(
            intent=QueryIntent.GENERAL,
            model_type="gemini",
            model_name="gemini-1.5-pro",
            confidence=0.5,
            explanation="Fallback to general model due to classification error"
        )

# 14. Main Endpoints
@app.get("/")
async def root():
    """
    Root endpoint returning API information and available endpoints.

    Returns:
        dict: API information and endpoint list
    """
    return {
        "name": "AI Model Gateway",
        "version": "1.0.0",
        "documentation": "/docs",
        "endpoints": {
            "model": "/model",
            "router": "/router",
            "smart_model": "/smart-model",
            "health": "/health",
            "test_openai": "/test/openai",
            "test_gemini": "/test/gemini",
            "test_deepseek": "/test/deepseek"
        }
    }

@app.post("/auth/login", response_model=Token)
async def login(user_auth: UserAuth):
    """
    Authenticate user and return JWT token.

    Args:
        user_auth: User authentication credentials

    Returns:
        Token: JWT access token

    Raises:
        HTTPException: If credentials are invalid
    """
    if user_auth.username not in API_USERS or API_USERS[user_auth.username] != user_auth.password:
        raise HTTPException(
            status_code=401,
            detail="Incorrect username or password",
            headers={"WWW-Authenticate": "Bearer"},
        )

    access_token_expires = timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)
    access_token = create_access_token(
        data={"sub": user_auth.username}, expires_delta=access_token_expires
    )
    return {"access_token": access_token, "token_type": "bearer"}

@app.post("/router", response_model=RouterResponse)
async def route_query(request: RouterRequest):
    """
    Route query to appropriate model based on intent classification.

    Args:
        request: Router request containing query details

    Returns:
        RouterResponse: Routing decision with model selection

    Raises:
        HTTPException: If routing fails
    """
    try:
        return await classify_intent(
            request.input_data,
            request.context_length,
            request.cost_sensitive
        )
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Routing error: {str(e)}"
        )

@app.post("/smart-model", response_model=ModelResponse)
async def smart_model(
    input_data: str = Query(..., description="Input text to process"),
    context_length: Optional[int] = Query(default=1000, description="Length of context in tokens"),
    cost_sensitive: Optional[bool] = Query(default=False, description="Whether to prioritize cost efficiency"),
    user: dict = Depends(verify_token)
):
    """
    Smart model endpoint that automatically routes queries to appropriate model.

    Args:
        input_data: Input text to process
        context_length: Length of context in tokens
        cost_sensitive: Whether to prioritize cost efficiency
        user: Authenticated user information

    Returns:
        ModelResponse: Response from the selected model

    Raises:
        HTTPException: If processing fails
    """
    try:
        if context_length <= 0:
            raise HTTPException(
                status_code=400,
                detail="context_length must be greater than 0"
            )

        routing = await classify_intent(
            input_data,
            context_length,
            cost_sensitive
        )

        max_tokens = ROUTER_CONFIGS[routing.intent].max_tokens

        if context_length > 2001:
            routing.model_type = "gemini"
            routing.model_name = "gemini-1.5-pro"
            routing.intent = QueryIntent.GENERAL
            max_tokens = 4001

        if routing.model_type == "openai" and max_tokens > 1001:
            max_tokens = 1001
        elif routing.model_type == "deepseek" and max_tokens > 4001:
            max_tokens = 4001
        elif routing.model_type == "gemini" and max_tokens > 4001:
            max_tokens = 4001

        if routing.model_type == "openai":
            result = await openai_model(
                input_data,
                routing.model_name,
                max_tokens
            )
        elif routing.model_type == "gemini":
            result = await gemini_model(
                input_data,
                routing.model_name,
                max_tokens
            )
        elif routing.model_type == "deepseek":
            result = await deepseek_model(
                input_data,
                routing.model_name,
                max_tokens
            )
        else:
            raise HTTPException(
                status_code=400,
                detail=f"Unsupported model type: {routing.model_type}"
            )

        return result
    except Exception as e:
        logger.error(f"Error in smart_model: {str(e)}")
        return ModelResponse(
            status="error",
            model="router",
            error=str(e)
        )

@app.post("/model", response_model=ModelResponse)
async def model_gateway(request: ModelRequest):
    """
    Direct model access endpoint.

    Args:
        request: Model request containing input and model details

    Returns:
        ModelResponse: Response from the specified model

    Raises:
        HTTPException: If processing fails
    """
    try:
        if request.model_type == "openai":
            result = await openai_model(
                request.input_data,
                request.model_name,
                request.max_tokens
            )
        elif request.model_type == "gemini":
            result = await gemini_model(
                request.input_data,
                request.model_name,
                request.max_tokens
            )
        elif request.model_type == "deepseek":
            result = await deepseek_model(
                request.input_data,
                request.model_name,
                request.max_tokens
            )
        else:
            raise HTTPException(
                status_code=400,
                detail=f"Unsupported model type: {request.model_type}"
            )

        if result.status == "error":
            raise HTTPException(
                status_code=500,
                detail=result.error
            )

        return result

    except HTTPException:
        raise
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Gateway error: {str(e)}"
        )

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """
    Health check endpoint to verify model availability.

    Returns:
        HealthResponse: Status of all models
    """
    model_status = {}
    for model, env_var in {
        "openai": "AZURE_OPENAI_API_KEY",
        "gemini": "GEMINI_API_KEY",
        "deepseek": "DEEPSEEK_URL"
    }.items():
        is_available = bool(os.getenv(env_var))
        if is_available:
            try:
                await test_model_connectivity(model)
            except Exception as e:
                logger.error(f"Model {model} connectivity test failed: {str(e)}")
                is_available = False
        model_status[model] = is_available

    return HealthResponse(
        status="healthy" if all(model_status.values()) else "degraded",
        models=model_status
    )

# 15. Test Endpoints
@app.post("/test/openai", response_model=ModelResponse)
async def test_openai(request: TestRequest):
    """
    Test endpoint for OpenAI model.

    Args:
        request: Test request with input data

    Returns:
        ModelResponse: Response from OpenAI model

    Raises:
        HTTPException: If test fails
    """
    try:
        result = await openai_model(
            request.input_data,
            "gpt-4",
            request.max_tokens
        )

        if result.status == "error":
            raise HTTPException(
                status_code=500,
                detail=result.error
            )

        return result
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"OpenAI test error: {str(e)}"
        )

@app.post("/test/gemini", response_model=ModelResponse)
async def test_gemini(request: TestRequest):
    """
    Test endpoint for Gemini model.

    Args:
        request: Test request with input data

    Returns:
        ModelResponse: Response from Gemini model

    Raises:
        HTTPException: If test fails
    """
    try:
        result = await gemini_model(
            request.input_data,
            "gemini-2.0-flash",
            request.max_tokens
        )

        if result.status == "error":
            raise HTTPException(
                status_code=500,
                detail=result.error
            )

        return result
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Gemini test error: {str(e)}"
        )

@app.post("/test/deepseek", response_model=ModelResponse)
async def test_deepseek(request: TestRequest):
    """
    Test endpoint for DeepSeek model.

    Args:
        request: Test request with input data

    Returns:
        ModelResponse: Response from DeepSeek model

    Raises:
        HTTPException: If test fails
    """
    try:
        result = await deepseek_model(
            request.input_data,
            "deepseek-chat",
            request.max_tokens
        )

        if result.status == "error":
            raise HTTPException(
                status_code=500,
                detail=result.error
            )

        return result
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"DeepSeek test error: {str(e)}"
        )

# 16. Main entry point
if __name__ == '__main__':
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8050)

Gateway Endpoints

Invoking model with context

output

General Query or Prompt

Router routes to Local self hosted LLM (DEEPSEEK)

Note

Implementing an Effective Intent Classification System
The key to a successful router is accurate intent classification. Here's how we've optimized our classification function with few-shot prompting

# Intent Classification Prompt Template
def classify_intent(user_query):
    """
    Classify user query intent using few-shot examples to achieve high confidence scores.

    Input: User query text
    Output: Intent classification with confidence score

    Example few-shot prompts:

    Query: "Can you help me debug this Python function?"
    Intent: TECHNICAL
    Confidence: 0.95
    Reasoning: Contains programming language reference and technical task request.

    Query: "How do I set up a CI/CD pipeline for my Node.js application?"
    Intent: TECHNICAL
    Confidence: 0.97
    Reasoning: Involves DevOps implementation and programming framework specifics.

    Query: "Explain the error in this SQL query: SELECT * FROM users WHERE username = 'john' AND AND email = 'john@example.com'"
    Intent: TECHNICAL
    Confidence: 0.98
    Reasoning: Contains specific database query syntax and error identification request.

    Query: "What's the capital of France?"
    Intent: GENERAL
    Confidence: 0.98
    Reasoning: Simple factual question requiring basic knowledge.

    Query: "Can you summarize the key points of climate change?"
    Intent: GENERAL
    Confidence: 0.94
    Reasoning: Requests information synthesis on a general knowledge topic without specialized expertise.

    Query: "What are the main differences between capitalism and socialism?"
    Intent: GENERAL
    Confidence: 0.96
    Reasoning: Comparative analysis question on broad economic/political systems.

    Query: "What's the process for requesting time off in our company?"
    Intent: HR
    Confidence: 0.93
    Reasoning: Involves company policy related to employee leave and HR procedures.

    Query: "How should I prepare for my annual performance review?"
    Intent: HR
    Confidence: 0.91
    Reasoning: Related to employee evaluation process and professional development.

    Query: "What are best practices for addressing conflicts between team members?"
    Intent: HR
    Confidence: 0.89
    Reasoning: Involves workplace relationship management and conflict resolution.
"""

By providing diverse examples with explicit reasoning, our router achieves higher accuracy in intent classification. We've included examples in multiple languages and with varying complexity to handle edge cases.

Thanks
Sreeni Ramadorai

AI Agents and LLM Models: Routers and Gateways for Efficient Management