System Architecture

The Semantic Router implements a sophisticated Mixture-of-Models (MoM) architecture using Envoy Proxy as the foundation, with an External Processor (ExtProc) service that provides intelligent routing capabilities. This design ensures high performance, scalability, and maintainability for production LLM deployments.

High-Level Architecture Overview

graph TB
    subgraph "Client Layer"
        Client1[Web Application]
        Client2[Mobile App]
        Client3[API Client]
        Client4[Third-party Integration]
    end

    subgraph "Proxy Layer"
        Envoy[Envoy Proxy<br/>:8801]
    end

    subgraph "Processing Layer"
        ExtProc[Semantic Router<br/>ExtProc Server<br/>:50051]

        subgraph "Router Components"
            Classifier[BERT Classifier<br/>ModernBERT]
            PIIDetector[PII Detector<br/>Privacy Protection]
            JailbreakGuard[Jailbreak Guard<br/>Security]
            Cache[Semantic Cache<br/>Performance]
            ToolsSelector[Tools Selector<br/>Optimization]
        end
    end

    subgraph "Model Layer"
        Model1[Math Specialist<br/>Endpoint 1]
        Model2[Creative Model<br/>Endpoint 2] 
        Model3[Code Generator<br/>Endpoint 3]
        ModelN[General Purpose<br/>Endpoint N]
    end

    subgraph "Monitoring Layer"
        Prometheus[Prometheus<br/>Metrics]
        Grafana[Grafana<br/>Dashboard]
        Logs[Structured Logging]
    end

    Client1 --> Envoy
    Client2 --> Envoy
    Client3 --> Envoy
    Client4 --> Envoy

    Envoy <--> ExtProc

    ExtProc --> Classifier
    ExtProc --> PIIDetector
    ExtProc --> JailbreakGuard
    ExtProc --> Cache
    ExtProc --> ToolsSelector

    Envoy --> Model1
    Envoy --> Model2
    Envoy --> Model3
    Envoy --> ModelN

    ExtProc --> Prometheus
    Prometheus --> Grafana
    ExtProc --> Logs

Core Components

1. Envoy Proxy - Traffic Management Layer

Role: Acts as the entry point and traffic director for all LLM requests.

Key Responsibilities: - Load Balancing: Distributes requests across backend model endpoints - Health Checking: Monitors backend model availability and health - Request/Response Processing: Handles HTTP protocol management - Header Management: Manages routing headers set by the ExtProc service - Timeout Management: Configures appropriate timeouts for different model types

Configuration Highlights:

# Envoy listener configuration
listeners:
- name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8801  # Main entry point

http_filters:
- name: envoy.filters.http.ext_proc
  typed_config:
    grpc_service:
      envoy_grpc:
        cluster_name: extproc_service
    processing_mode:
      request_header_mode: "SEND"      # Send headers for routing decisions
      response_header_mode: "SEND"     # Process response headers  
      request_body_mode: "BUFFERED"    # Analyze request content
      response_body_mode: "BUFFERED"   # Process response content

2. Semantic Router ExtProc Service - Intelligence Layer

Role: The brain of the system that makes intelligent routing decisions.

Architecture:

type OpenAIRouter struct {
    Config               *config.RouterConfig
    CategoryDescriptions []string
    Classifier           *classification.Classifier    // ModernBERT-based
    PIIChecker           *pii.PolicyChecker           // Privacy protection
    Cache                *cache.SemanticCache         // Performance optimization
    ToolsDatabase        *tools.ToolsDatabase         // Tool selection

    pendingRequests     map[string][]byte             // Request tracking
    pendingRequestsLock sync.Mutex                    // Thread safety
}

Processing Pipeline:

sequenceDiagram
    participant E as Envoy
    participant R as Router
    participant C as Classifier
    participant P as PII Detector
    participant G as Guard
    participant Ca as Cache

    E->>R: Request Headers + Body
    R->>Ca: Check semantic cache

    alt Cache Hit
        Ca->>R: Cached response
        R->>E: Return cached result
    else Cache Miss
        R->>P: Scan for PII
        P->>R: PII status
        R->>G: Check for jailbreak
        G->>R: Safety status
        R->>C: Classify intent
        C->>R: Routing decision
        R->>E: Set routing headers
        Note over E: Route to selected model
        E->>R: Response from model
        R->>Ca: Cache semantic representation
        R->>E: Final response
    end

3. Classification System - Decision Engine

The classification system uses ModernBERT models for multiple classification tasks:

Category Classification

graph LR
    Query[User Query] --> Tokenizer[ModernBERT Tokenizer]
    Tokenizer --> Encoder[ModernBERT Encoder<br/>768-dim embeddings]
    Encoder --> ClassifierHead[Classification Head<br/>Category Prediction]
    ClassifierHead --> Decision[Routing Decision]

    subgraph "Categories"
        Math[Mathematics]
        Creative[Creative Writing]
        Code[Code Generation]  
        General[General Purpose]
        Science[Science]
        Business[Business]
    end

    Decision --> Math
    Decision --> Creative  
    Decision --> Code
    Decision --> General
    Decision --> Science
    Decision --> Business

Multi-Task Architecture

# Conceptual model architecture
class SemanticRouter:
    def __init__(self):
        self.category_classifier = ModernBERTForSequenceClassification(
            num_labels=10  # Math, Creative, Code, etc.
        )
        self.pii_detector = ModernBERTForTokenClassification(
            num_labels=6   # PERSON, EMAIL, PHONE, SSN, LOCATION, NO_PII
        )
        self.jailbreak_guard = ModernBERTForSequenceClassification(
            num_labels=2   # Benign, Jailbreak
        )

    def route_request(self, query):
        # Multi-task inference
        category = self.category_classifier(query)
        pii_entities = self.pii_detector(query)  
        safety_score = self.jailbreak_guard(query)

        return self.make_routing_decision(category, pii_entities, safety_score)

Data Flow Architecture

Request Processing Flow

graph TB
    Start([Client Request]) --> EnvoyReceive[Envoy Receives Request]

    EnvoyReceive --> ExtProcSend[Send to ExtProc<br/>Headers + Body]

    ExtProcSend --> CacheCheck{Semantic Cache<br/>Check}

    CacheCheck -->|Hit| CacheReturn[Return Cached Response]
    CacheCheck -->|Miss| SecurityCheck[Security Checks]

    SecurityCheck --> PIICheck[PII Detection<br/>ModernBERT Token Classification]
    PIICheck --> JailbreakCheck[Jailbreak Detection<br/>ModernBERT Binary Classification]

    JailbreakCheck --> SecurityDecision{Security<br/>Assessment}
    SecurityDecision -->|Block| BlockRequest[Block Request<br/>Return Error]
    SecurityDecision -->|Allow| CategoryClassification[Category Classification<br/>ModernBERT Sequence Classification]

    CategoryClassification --> ToolsSelection[Tools Auto-Selection<br/>Reduce Token Usage]

    ToolsSelection --> RoutingDecision[Make Routing Decision<br/>Select Optimal Model]

    RoutingDecision --> SetHeaders[Set Routing Headers<br/>x-gateway-destination-endpoint<br/>x-selected-model]

    SetHeaders --> EnvoyRoute[Envoy Routes to<br/>Selected Backend]

    EnvoyRoute --> ModelResponse[Model Processes<br/>Request]

    ModelResponse --> ResponseProcess[Process Response<br/>via ExtProc]

    ResponseProcess --> CacheStore[Store in Semantic Cache<br/>for Future Requests]

    CacheStore --> FinalResponse[Return Response<br/>to Client]

    CacheReturn --> FinalResponse
    BlockRequest --> End([End])
    FinalResponse --> End

    style SecurityCheck fill:#ffeb3b
    style PIICheck fill:#ff9800  
    style JailbreakCheck fill:#f44336
    style CategoryClassification fill:#4caf50
    style CacheCheck fill:#2196f3
    style RoutingDecision fill:#9c27b0

Response Processing Flow

sequenceDiagram
    participant C as Client
    participant E as Envoy
    participant R as Router
    participant M as Selected Model
    participant Ca as Cache
    participant Me as Metrics

    C->>E: HTTP Request
    E->>R: ExtProc Request (Headers + Body)

    Note over R: Process request (PII, Security, Classification)

    R->>E: ExtProc Response (Routing Headers)
    E->>M: Route to Selected Model
    M->>E: Model Response
    E->>R: ExtProc Response Processing

    R->>Ca: Store semantic representation
    R->>Me: Record routing metrics

    R->>E: Processed Response
    E->>C: Final Response to Client

Threading and Concurrency Model

Go ExtProc Server Concurrency

// Server handles multiple concurrent connections
func (s *Server) Start() error {
    lis, err := net.Listen("tcp", fmt.Sprintf(":%d", s.port))
    if err != nil {
        return fmt.Errorf("failed to listen on port %d: %w", s.port, err)
    }

    s.server = grpc.NewServer()
    ext_proc.RegisterExternalProcessorServer(s.server, s.router)

    // gRPC handles concurrency automatically
    // Each request gets its own goroutine
    return s.server.Serve(lis)
}

// Process handles individual request streams
func (r *OpenAIRouter) Process(stream ext_proc.ExternalProcessor_ProcessServer) error {
    // Each stream runs in its own goroutine
    ctx := &RequestContext{
        Headers: make(map[string]string),
    }

    for {
        req, err := stream.Recv()
        // Process request with thread-safe operations
        switch v := req.Request.(type) {
        case *ext_proc.ProcessingRequest_RequestHeaders:
            // Handle request headers
        case *ext_proc.ProcessingRequest_RequestBody:
            // Handle request body - where classification happens
        case *ext_proc.ProcessingRequest_ResponseHeaders:
            // Handle response headers
        }
    }
}

Thread Safety Considerations

type OpenAIRouter struct {
    // Thread-safe components
    Classifier           *classification.Classifier    // Read-only after init
    PIIChecker           *pii.PolicyChecker           // Read-only after init
    Cache                *cache.SemanticCache         // Internally synchronized

    // Mutable state with protection
    pendingRequests     map[string][]byte
    pendingRequestsLock sync.Mutex                    // Protects pendingRequests
}

// Thread-safe request tracking
func (r *OpenAIRouter) trackRequest(id string, body []byte) {
    r.pendingRequestsLock.Lock()
    defer r.pendingRequestsLock.Unlock()
    r.pendingRequests[id] = body
}

Performance Characteristics

Latency Analysis

Component	Typical Latency	Optimization
Envoy Routing	0.5-2ms	Optimized configuration
ExtProc gRPC	1-3ms	Local network communication
PII Detection	5-15ms	ModernBERT token classification
Jailbreak Guard	3-8ms	ModernBERT binary classification
Category Classification	8-20ms	ModernBERT sequence classification
Cache Lookup	0.1-0.5ms	Redis/in-memory cache
Total Overhead	15-50ms	Acceptable for most use cases

Throughput Optimization

// Batch processing for efficiency
type BatchProcessor struct {
    batchSize    int
    batchTimeout time.Duration
    classifier   *classification.Classifier
}

func (bp *BatchProcessor) processBatch(queries []string) []Classification {
    // Process multiple queries together for better GPU utilization
    return bp.classifier.ClassifyBatch(queries)
}

Memory Usage

Component	Memory Usage	Notes
ModernBERT Models	~400MB each	Loaded once, shared across requests
Envoy Process	~100-200MB	Depends on configuration
Go ExtProc Server	~50-100MB	Scales with concurrent requests
Semantic Cache	~500MB-2GB	Configurable, depends on cache size
Total System	~1.5-3GB	Reasonable for production deployment

Configuration Management

Router Configuration Structure

# config/config.yaml
router:
  # Model endpoints configuration
  endpoints:
    endpoint1:
      url: "http://192.168.12.90:11434"
      model_type: "math"
      cost_per_token: 0.002
      max_tokens: 4096

    endpoint2:
      url: "http://192.168.12.91:11434" 
      model_type: "creative"
      cost_per_token: 0.003
      max_tokens: 8192

    endpoint3:
      url: "http://192.168.12.92:11434"
      model_type: "general"
      cost_per_token: 0.01
      max_tokens: 4096

  # Classification thresholds
  classification:
    confidence_threshold: 0.7
    fallback_model: "general"

  # Security settings
  security:
    enable_pii_detection: true
    enable_jailbreak_guard: true
    pii_action: "block"  # block, mask, or allow

  # Caching configuration
  cache:
    enabled: true
    similarity_threshold: 0.85
    ttl_seconds: 3600
    max_entries: 10000

  # Tools configuration
  tools:
    auto_selection: true
    max_tools: 5
    relevance_threshold: 0.6

Dynamic Configuration Updates

// Configuration hot-reloading
type ConfigManager struct {
    config     *RouterConfig
    configLock sync.RWMutex
    watchers   []ConfigWatcher
}

func (cm *ConfigManager) UpdateConfig(newConfig *RouterConfig) error {
    cm.configLock.Lock()
    defer cm.configLock.Unlock()

    // Validate new configuration
    if err := newConfig.Validate(); err != nil {
        return err
    }

    // Apply configuration
    cm.config = newConfig

    // Notify all watchers
    for _, watcher := range cm.watchers {
        watcher.OnConfigUpdate(newConfig)
    }

    return nil
}

Error Handling and Resilience

Circuit Breaker Pattern

type CircuitBreaker struct {
    maxFailures   int
    resetTimeout  time.Duration
    state         CircuitState
    failures      int
    lastFailTime  time.Time
    mutex         sync.Mutex
}

func (cb *CircuitBreaker) Call(operation func() error) error {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()

    if cb.state == StateOpen {
        if time.Since(cb.lastFailTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
        } else {
            return errors.New("circuit breaker is open")
        }
    }

    err := operation()
    if err != nil {
        cb.onFailure()
    } else {
        cb.onSuccess()
    }

    return err
}

Fallback Strategies

graph TB
    Request[Incoming Request] --> PrimaryRoute[Primary Routing Decision]

    PrimaryRoute --> ModelA{Model A<br/>Available?}
    ModelA -->|Yes| ProcessA[Process with Model A]
    ModelA -->|No| FallbackB{Try Model B<br/>Fallback}

    FallbackB -->|Available| ProcessB[Process with Model B]
    FallbackB -->|Unavailable| FallbackGeneral{Try General<br/>Model}

    FallbackGeneral -->|Available| ProcessGeneral[Process with General Model]
    FallbackGeneral -->|Unavailable| CachedResponse{Check Cache<br/>for Similar}

    CachedResponse -->|Found| ReturnCached[Return Cached Response]
    CachedResponse -->|Not Found| ErrorResponse[Return Error<br/>Service Unavailable]

    ProcessA --> Success[Successful Response]
    ProcessB --> Success
    ProcessGeneral --> Success
    ReturnCached --> Success

Monitoring and Observability

Metrics Collection

// Prometheus metrics
var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "semantic_router_requests_total",
            Help: "Total number of requests processed",
        },
        []string{"endpoint", "category", "status"},
    )

    routingLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "semantic_router_routing_duration_seconds", 
            Help: "Time spent on routing decisions",
            Buckets: prometheus.DefBuckets,
        },
        []string{"component"},
    )

    cacheHitRatio = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "semantic_router_cache_hit_ratio",
            Help: "Cache hit ratio for semantic cache",
        },
        []string{"cache_type"},
    )
)

Structured Logging

type RequestLogger struct {
    logger *logrus.Logger
}

func (rl *RequestLogger) LogRouting(ctx context.Context, decision *RoutingDecision) {
    rl.logger.WithFields(logrus.Fields{
        "request_id":         ctx.Value("request_id"),
        "category":          decision.Category,
        "confidence":        decision.Confidence,
        "selected_model":    decision.SelectedModel,
        "routing_time_ms":   decision.ProcessingTime.Milliseconds(),
        "pii_detected":      decision.PIIDetected,
        "jailbreak_risk":    decision.JailbreakRisk,
        "cache_hit":         decision.CacheHit,
        "tools_selected":    len(decision.SelectedTools),
    }).Info("Request routed")
}

This architecture provides a robust, scalable, and maintainable foundation for intelligent LLM routing. The next section covers the Envoy ExtProc Integration in detail, explaining how the ExtProc protocol works and how our router implements it.