Inference & Model Routing

Srasta routes each persona to the right model under governance.

Srasta separates the operator decision from the runtime plumbing. Teams choose where inference runs and what each persona needs; Srasta turns that into governed LiteLLM routes, vLLM or Ollama runtime settings, fallback behavior, model-access controls, and smoke tests.

Routing Path

One gateway, many possible inference backends.

A caller can ask for coding, business, general, or fast. Those are stable model aliases. The backend behind each alias can be local GPU inference, CPU fallback, external self-hosted inference, or a hosted API.

Srasta inference routing path Governed inference route Srasta API policy gate LiteLLM alias router vLLM GPU Ollama CPU External self-hosted Hosted API Response + audit aliases: coding · business · general · fast fallbacks and retries stay inside the routing layer

Provider Classes

Operators choose the inference class before choosing the model.

The installer treats inference as a deployment decision. Local inference keeps prompts and responses inside the Srasta cluster. External inference can be self-hosted or hosted, but it creates an explicit data-egress decision that operators must acknowledge.

Bundled vLLM

Srasta deploys a GPU-backed OpenAI-compatible vLLM runtime on the selected GPU host.

Bundled Ollama

CPU-only fallback for trials, dev, smoke tests, and low-resource installations.

External self-hosted

Operator-provided vLLM, NVIDIA NIM, or generic OpenAI-compatible endpoint.

Hosted API

Provider APIs such as Anthropic, OpenAI, Hugging Face, Together, or Fireworks through LiteLLM.

Personas

Persona aliases make routing stable while models evolve.

coding

Code and tool-heavy work

Routes to coding-capable models and requires tool-call parser correctness for agentic workflows.

business

Structured enterprise reasoning

Optimized for multi-document reasoning, summaries, structured outputs, and decision support.

general

Balanced chat and reasoning

Absorbs everyday chat and general-purpose workloads with a quality/latency balance.

fast

Low-latency fallback

Lightweight route used for quick responses, fallback chains, and constrained installations.

Recommendation Engine

Model fit is a function of persona, hardware, volume, and constraints.

The installer does more than ask whether a model can load. It evaluates fit against the operator's environment: GPU and RAM, inference class, deployment intent, expected concurrency, latency and quality targets, and cost constraints for hosted providers.

Ranking dimensions

  • Hardware fit and VRAM envelope.
  • Latency targets such as TTFT and token throughput.
  • Persona-specific quality requirements.
  • Throughput and peak concurrency fit.
  • Hosted-provider cost constraints when relevant.
  • Operational consistency across the selected persona set.

Runtime Contracts

Correct routing includes parser and embedding decisions.

Tool-call parser

vLLM agentic tool calls require the right parser for the model family. Srasta tracks this per model so tool calls do not silently degrade into plain text.

Reasoning parser

Models with separate thinking output can declare the matching reasoning parser where supported.

Embedding route

TEI serves the default embedding path where supported. ARM or constrained deployments can fall back to Ollama embeddings.

Fallback chain

Primary persona routes can fall back to a lightweight route such as fast when configured by the installer.

Governance

Routing does not bypass Srasta's security model.

Whether a model is local, self-hosted, or hosted, access still enters through the same governed Srasta API path. Model routing is not a backdoor around identity, authorization, audit, rate limits, or license posture.

Per-role model whitelist

Roles can be granted explicit model access; unauthorized requests fail before execution.

Audit on every route

Inference requests, failures, policy denials, and external-provider calls produce audit evidence.

External egress acknowledgment

External inference is called out because prompts and responses leave the Srasta cluster.

Operator override

Recommendations are defaults; operators can override persona assignments with visible tradeoffs.

FAQ

Inference questions operators usually ask first.

Where can Srasta run inference?

Srasta supports local inference with bundled vLLM on GPU or Ollama on CPU, external self-hosted inference such as vLLM, NVIDIA NIM, or OpenAI-compatible endpoints, and hosted APIs such as Anthropic, OpenAI, Hugging Face, Together, or Fireworks.

What does LiteLLM do in Srasta?

LiteLLM is the unified inference gateway. Srasta generates model routes and persona aliases so callers can request models like coding, business, general, or fast while LiteLLM routes to the configured backend.

How does Srasta choose models?

The installer uses hardware profile, intent, inference provider, personas, expected volume, and operator constraints to recommend models per persona. Operators can accept recommendations or override each persona.

How is model access governed?

Model requests still pass through Srasta API controls: identity, RBAC, per-role model whitelist, license posture, rate limits, audit, and downstream routing through LiteLLM or the configured provider.

Plan the Runtime

Start with the personas and constraints, then pick models.

A useful Srasta deployment starts by mapping who will use the system, what work they do, where inference is allowed to run, and what latency, quality, cost, and data-boundary constraints matter.

Plan deployment