Trial
Single host, CPU inference through Ollama, full governance and RAG surface, no HA, no backup tier.
Best for evaluation and first proof.Deployment Topologies
Srasta topology is not a one-size-fits-all compose file. The installer asks what kind of environment the operator wants, probes available hosts, recommends a feasible profile, explains trade-offs, places services, and verifies the runtime before handoff.
Recommendation Flow
The operator declares intent, the installer probes hardware, and Srasta recommends the highest feasible profile for that intent. When hardware does not satisfy the desired shape, the operator sees limitations and upgrade paths instead of a vague failure.
Profiles
Single host, CPU inference through Ollama, full governance and RAG surface, no HA, no backup tier.
Best for evaluation and first proof.One GPU host runs control plane, stateful services, and bundled vLLM inference on the same machine.
Best for small teams with one strong box.CPU control plane plus dedicated GPU worker. Inference is separated from stateful and app workloads.
Common regulated production starting point.Three or more hosts with role separation, app-tier HA, isolated inference, observability, auth, and backup agent.
Best for production availability.Five or more hosts, full role separation, dedicated backup target, stronger RTO/RPO, and single-rack failure posture.
Best for stricter recovery requirements.Any profile can reduce local GPU dependency when the operator chooses external vLLM, NIM, hosted API, or another provider.
Best when inference is already standardized elsewhere.Deployment Backends
Srasta supports simple Compose-based starts and platform-team Kubernetes deployments. The decision is usually less about product capability and more about the customer’s operating model.
Fastest path for trial, prototype, demo, and one-team evaluation. Everything runs on one Linux host.
Installer reaches worker nodes by SSH, probes capability, places services, syncs config, and verifies each host.
Uses existing cluster primitives: namespace, storage class, ingress, GPU nodes, RBAC, probes, services, and Helm values.
Provider-oriented path for customers using managed clusters, GPU node pools, cloud load balancers, and cloud storage patterns.
Placement
Srasta’s placement logic separates control-plane, app, stateful, observability, backup, and inference roles. GPU hosts should serve inference; CPU hosts should absorb stateful and app workloads when available.
Installer, plan/run state, topology, placement, access URLs, and operator lifecycle coordination.
Srasta API, Admin, RAG API, Tool Gateway, service discovery, gateway, and related app workloads.
Postgres, Milvus, MinIO, Valkey, audit volumes, backup metadata, and recovery state.
vLLM, Ollama fallback, LiteLLM routing, GPU model placement, and parser-aware runtime configuration.
Langfuse, metrics, traces, audit review, and operator visibility surfaces.
Backup agents, isolated backup targets, restore plans, recovery readiness, and DR validation.
Access Modes
Trade-Offs
A small team with regulated data may need stronger separation than a larger team running a low-risk internal assistant. Srasta keeps the trade-off visible: speed of install, local inference quality, availability, recovery posture, and operational complexity.
Trial and compact single-host minimize setup friction and help validate the first governed workflow quickly.
Compact 1+1 moves GPU inference off the control plane and reduces IO contention.
Production HA adds role separation and app-tier resilience, but stateful tier recovery still depends on backup posture.
HA + DR adds a dedicated backup target and stronger recovery goals at the cost of more hosts and operating discipline.
FAQ
Start with the smallest topology that can prove the workflow safely. Trial and compact single-host are good for evaluation. Compact 1+1 is the common production starting point when local GPU inference is required. Production HA and HA + DR are for stricter availability and recovery requirements.
No. CPU-only trial inference is supported through Ollama, and external inference can remove the local GPU requirement. Local production inference usually needs at least one NVIDIA GPU host.
Yes. The control plane can run on one architecture while GPU workers run another, such as amd64 control plane plus arm64 NVIDIA GB10 workers, because Srasta images are published as multi-architecture manifests.
Use Kubernetes when the customer already operates clusters and wants Helm-managed workloads, storage classes, ingress, RBAC, and platform-team lifecycle controls. Single-node and guided multi-host Compose remain valid for simpler installs.
Next
Once topology is clear, Srasta can guide model routing, GPU choice, access mode, verification, and day-2 operations without turning deployment into guesswork.