Inference & Model Routing

Srasta routes each persona to the right model under governance.

Srasta separates the operator decision from the runtime plumbing. Teams choose where inference runs and what each persona needs; Srasta turns that into governed LiteLLM routes, certified vLLM or MLX runtime settings, model-access controls, and smoke tests.

Review deployment models Security model

Routing Path

One gateway, many possible inference backends.

A caller can ask for coding, business, or general. Those are stable model aliases. The backend behind each alias is a catalog-certified local runtime or a controlled OpenAI-compatible endpoint allowed by the deployment profile.

Provider Classes

Operators choose the inference class before choosing the model.

The installer treats inference as a deployment decision. Local inference keeps prompts and responses inside the Srasta cluster. External inference can be self-hosted or hosted, but it creates an explicit data-egress decision that operators must acknowledge.

Bundled vLLM

Srasta deploys a GPU-backed OpenAI-compatible vLLM runtime on the selected GPU host.

Bundled MLX

Apple Silicon Community runtime served host-native through Srasta's certified MLX profile.

External self-hosted

Operator-provided vLLM, NVIDIA NIM, or generic OpenAI-compatible endpoint.

Hosted API

Provider APIs such as Anthropic, OpenAI, Hugging Face, Together, or Fireworks through LiteLLM.

Personas

Persona aliases make routing stable while models evolve.

coding

Code and tool-heavy work

Routes to coding-capable models and requires tool-call parser correctness for agentic workflows.

business

Structured enterprise reasoning

Optimized for multi-document reasoning, summaries, structured outputs, and decision support.

general

Balanced chat and reasoning

Absorbs everyday chat and general-purpose workloads with a quality/latency balance.

review

Governed review work

Routes review and approval workflows through the same policy, audit, and model-access controls.

Recommendation Engine

Model fit is a function of persona, hardware, volume, and constraints.

The installer does more than ask whether a model can load. It evaluates fit against the operator's environment: GPU and RAM, inference class, deployment intent, expected concurrency, latency and quality targets, and cost constraints for hosted providers.

Ranking dimensions

Hardware fit and VRAM envelope.
Latency targets such as TTFT and token throughput.
Persona-specific quality requirements.
Throughput and peak concurrency fit.
Hosted-provider cost constraints when relevant.
Operational consistency across the selected persona set.

Runtime Contracts

Correct routing includes parser and embedding decisions.

Tool-call parser

vLLM agentic tool calls require the right parser for the model family. Srasta tracks this per model so tool calls do not silently degrade into plain text.

Reasoning parser

Models with separate thinking output can declare the matching reasoning parser where supported.

Embedding route

Embeddings use the certified engine-native OpenAI-compatible path for the selected hardware cell.

Certified route

Persona routes resolve to catalog-certified model and runtime bindings rather than ad hoc fallback chains.

Governance

Routing does not bypass Srasta's security model.

Whether a model is local, self-hosted, or hosted, access still enters through the same governed Srasta API path. Model routing is not a backdoor around identity, authorization, audit, rate limits, or license posture.

Per-role model whitelist

Roles can be granted explicit model access; unauthorized requests fail before execution.

Audit on every route

Inference requests, failures, policy denials, and external-provider calls produce audit evidence.

External egress acknowledgment

External inference is called out because prompts and responses leave the Srasta cluster.

Operator override

Recommendations are defaults; operators can override persona assignments with visible tradeoffs.

FAQ

Inference questions operators usually ask first.

Where can Srasta run inference?

Srasta supports local inference through catalog-certified vLLM on NVIDIA GPU cells and MLX on Apple Silicon cells, plus controlled OpenAI-compatible external endpoints when the deployment profile allows it.

What does LiteLLM do in Srasta?

LiteLLM is the unified inference gateway. Srasta generates model routes and persona aliases so callers can request models like coding, business, or general while LiteLLM routes to the configured backend.

How does Srasta choose models?

The installer uses hardware profile, intent, inference provider, personas, expected volume, and operator constraints to recommend models per persona. Operators can accept recommendations or override each persona.

How is model access governed?

Model requests still pass through Srasta API controls: identity, RBAC, per-role model whitelist, license posture, rate limits, audit, and downstream routing through LiteLLM or the configured provider.

Plan the Runtime

Start with the personas and constraints, then pick models.

A useful Srasta deployment starts by mapping who will use the system, what work they do, where inference is allowed to run, and what latency, quality, cost, and data-boundary constraints matter.

Plan deployment Installer process Back to docs