Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.budecosystem.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

A self-hosted Bud deployment has two resource layers, and sizing each one separately is the key to right-sizing your infrastructure:

Platform services

The control plane — dashboard, APIs, gateway, databases, queue, object storage, and monitoring. Memory-led and largely fixed. CPU stays low at rest and rises only with request throughput.

Inference nodes

Where your models actually run. The dominant cost at scale. Provisioned on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen per model based on the model, use case, and latency/throughput (SLO) targets. The platform recommends the node type and configuration when you deploy each model.
Plan the platform services first (a relatively fixed footprint), then add inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you intend to serve. The two layers scale independently.

Choosing inference hardware

You add inference capacity by attaching nodes with the accelerator that best fits each model and its service-level objectives. The platform’s optimizer analyses the model and target SLO and recommends the node type and count.
HardwareTypical fit
GPU (CUDA)Largest models, lowest latency, highest concurrency
CPUSmaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites
HPU (Intel Gaudi)High-throughput serving where Gaudi accelerators are available
A single deployment can mix node types — for example, GPU nodes for latency-critical models and CPU nodes for smaller or background workloads.

Choosing a deployment size

One machine running the full platform plus one or more models locally. Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent users).
ResourceMinimumRecommended
CPU16 cores32 cores
Memory96 GiB128 GiB
Storage (NVMe SSD)1 TiB2 TiB
Inference acceleratorCPU-only (smaller / quantized models)1 × 48–80 GB GPU for larger models or low-latency serving
The platform services alone use roughly 50–60 GiB of memory, so 64 GiB is not enough once a model and the operating system are added. Likewise, the fixed storage footprint already exceeds 200 GiB before model files — start at 1 TiB.
Smaller or quantized models can serve on CPU. A single 48–80 GB GPU comfortably serves one model in the ~20–30B-parameter range at low latency. Larger models or higher concurrency call for the clustered tier.

Storage planning

Storage has a fixed baseline plus a growth component that you control through retention settings.

Fixed baseline

Independent of traffic, a deployment provisions storage for its databases, message queue, object storage, and your model files. Plan for roughly 1 TiB to start, with the model registry (where downloaded model files live) being the largest and most variable part — size it to your model catalog (commonly 300 GiB–1 TiB+).

Growth and retention

Usage data grows with traffic but is bounded by retention windows, so it reaches a steady state rather than growing forever:
DataWhat it isDefault retention
Inference analyticsPer-request metadata: tokens, latency, model, status90 days
ObservabilityTraces, logs, and metrics across the platform30 days (configurable)
Usage metricsAggregated dashboards and billing rollups90 days
Estimate the growth component with:
steady-state size ≈ bytes-per-request × requests-per-second × 86,400 × retention-days × replication
The default deployment stores per-request metadata only (not full prompt and response text), which keeps this small. As a guide, at a sustained 1,000 requests/second:
ConfigurationSteady-state (3 HA copies)
Metadata only (default)~6 TB, plateaus at the retention window
With full request/response logging enabled~19 TB, plateaus at the retention window
Scale linearly for other rates — e.g. 100 requests/second is roughly one tenth.
Enabling full request/response content logging substantially increases storage. If you turn it on, set an appropriate retention window first and provision accordingly.

Keeping storage predictable

  • Tune the observability retention window to your needs (shorter = less storage).
  • Keep full request/response logging off unless you need it, and bound it with a retention window when you do.
  • Use premium SSD/NVMe for databases and the model registry; standard SSD is fine for general application data.
  • For very large datasets, scale the analytics database horizontally rather than relying on replication alone.

Scaling

Platform services

  • The platform layer is memory-led; provision memory generously and treat CPU as elastic.
  • Stateless services support horizontal autoscaling (enable it per service for high availability and burst handling).
  • Run three copies of stateful services for high availability in the clustered and large-scale tiers.

Inference nodes

  • Add inference nodes to increase serving capacity. Choose the accelerator — GPU, CPU, or HPU — based on each model and its SLO; the platform recommends the node type and configuration at deployment time.
  • Scale out by adding model replicas across inference nodes as concurrency grows.
  • Keep inference nodes in a separate node pool from the platform services so the two scale independently, and mix node types to match each model.

Networking

TrafficMinimumRecommended
Between nodes5 Gbps10–40 Gbps (higher for inference pools)
Internet ingress/egress1 Gbps5 Gbps

Next steps

Installation Guide

Deploy the platform on Kubernetes

Helm Configuration

Configure resources, retention, and services

Deployment

Deployment options and workflows

Cluster Setup

Add and configure clusters