Documentation Index
Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
A self-hosted Bud deployment has two resource layers, and sizing each one separately is the key to right-sizing your infrastructure:Platform services
The control plane — dashboard, APIs, gateway, databases, queue, object
storage, and monitoring. Memory-led and largely fixed. CPU stays low at
rest and rises only with request throughput.
Inference nodes
Where your models actually run. The dominant cost at scale. Provisioned
on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen
per model based on the model, use case, and latency/throughput (SLO) targets.
The platform recommends the node type and configuration when you deploy each
model.
Plan the platform services first (a relatively fixed footprint), then add
inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you
intend to serve. The two layers scale independently.
Choosing inference hardware
You add inference capacity by attaching nodes with the accelerator that best fits each model and its service-level objectives. The platform’s optimizer analyses the model and target SLO and recommends the node type and count.| Hardware | Typical fit |
|---|---|
| GPU (CUDA) | Largest models, lowest latency, highest concurrency |
| CPU | Smaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites |
| HPU (Intel Gaudi) | High-throughput serving where Gaudi accelerators are available |
Choosing a deployment size
- Single Node (AI-in-a-Box)
- Clustered (High Availability)
- Large-Scale (Multi-Tenant)
One machine running the full platform plus one or more models locally.
Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent
users).
Smaller or quantized models can serve on CPU. A single 48–80 GB GPU
comfortably serves one model in the ~20–30B-parameter range at low latency.
Larger models or higher concurrency call for the clustered tier.
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 16 cores | 32 cores |
| Memory | 96 GiB | 128 GiB |
| Storage (NVMe SSD) | 1 TiB | 2 TiB |
| Inference accelerator | CPU-only (smaller / quantized models) | 1 × 48–80 GB GPU for larger models or low-latency serving |
Storage planning
Storage has a fixed baseline plus a growth component that you control through retention settings.Fixed baseline
Independent of traffic, a deployment provisions storage for its databases, message queue, object storage, and your model files. Plan for roughly 1 TiB to start, with the model registry (where downloaded model files live) being the largest and most variable part — size it to your model catalog (commonly 300 GiB–1 TiB+).Growth and retention
Usage data grows with traffic but is bounded by retention windows, so it reaches a steady state rather than growing forever:| Data | What it is | Default retention |
|---|---|---|
| Inference analytics | Per-request metadata: tokens, latency, model, status | 90 days |
| Observability | Traces, logs, and metrics across the platform | 30 days (configurable) |
| Usage metrics | Aggregated dashboards and billing rollups | 90 days |
| Configuration | Steady-state (3 HA copies) |
|---|---|
| Metadata only (default) | ~6 TB, plateaus at the retention window |
| With full request/response logging enabled | ~19 TB, plateaus at the retention window |
Keeping storage predictable
- Tune the observability retention window to your needs (shorter = less storage).
- Keep full request/response logging off unless you need it, and bound it with a retention window when you do.
- Use premium SSD/NVMe for databases and the model registry; standard SSD is fine for general application data.
- For very large datasets, scale the analytics database horizontally rather than relying on replication alone.
Scaling
Platform services
- The platform layer is memory-led; provision memory generously and treat CPU as elastic.
- Stateless services support horizontal autoscaling (enable it per service for high availability and burst handling).
- Run three copies of stateful services for high availability in the clustered and large-scale tiers.
Inference nodes
- Add inference nodes to increase serving capacity. Choose the accelerator — GPU, CPU, or HPU — based on each model and its SLO; the platform recommends the node type and configuration at deployment time.
- Scale out by adding model replicas across inference nodes as concurrency grows.
- Keep inference nodes in a separate node pool from the platform services so the two scale independently, and mix node types to match each model.
Networking
| Traffic | Minimum | Recommended |
|---|---|---|
| Between nodes | 5 Gbps | 10–40 Gbps (higher for inference pools) |
| Internet ingress/egress | 1 Gbps | 5 Gbps |
Next steps
Installation Guide
Deploy the platform on Kubernetes
Helm Configuration
Configure resources, retention, and services
Deployment
Deployment options and workflows
Cluster Setup
Add and configure clusters