Hardware Requirements - Bud Stack Documentation

Overview

A self-hosted Bud deployment has two resource layers, and sizing each one separately is the key to right-sizing your infrastructure:

Platform services

The control plane — dashboard, APIs, gateway, databases, queue, object storage, and monitoring. Memory is a largely fixed baseline; CPU scales with request throughput and is the binding constraint for high-throughput agent/event traffic — load testing runs out of CPU well before memory.

Inference nodes

Where your models actually run. The dominant cost at scale. Provisioned on dedicated inference nodes whose hardware — GPU, CPU, or HPU — is chosen per model based on the model, use case, and latency/throughput (SLO) targets. The platform recommends the node type and configuration when you deploy each model.

Plan the platform services first (a relatively fixed footprint), then add inference nodes of the appropriate type — CPU, GPU, or HPU — for the models you intend to serve. The two layers scale independently.

Choosing inference hardware

You add inference capacity by attaching nodes with the accelerator that best fits each model and its service-level objectives. The platform’s optimizer analyses the model and target SLO and recommends the node type and count.

Hardware	Typical fit
GPU (CUDA)	Largest models, lowest latency, highest concurrency
CPU	Smaller / quantized models, batch or latency-tolerant workloads, cost-sensitive sites
HPU (Intel Gaudi)	High-throughput serving where Gaudi accelerators are available

A single deployment can mix node types — for example, GPU nodes for latency-critical models and CPU nodes for smaller or background workloads.

Choosing a deployment size

The tiers below are sized by concurrent requests (requests in flight at once), not named users. Convert to/from throughput with concurrent ≈ requests/sec × average request latency — e.g. 1,000 req/s of a 10-second agent turn is ~10,000 concurrent. Holding concurrency is cheap on memory (~1–2 MiB per in-flight request on the platform, measured on the event path); CPU is driven by throughput (req/s) and is the resource that runs out first for high-throughput agent/event workloads. Size CPU to your peak req/s, not to a user count.

Single Node (AI-in-a-Box)
Clustered (High Availability)
Large-Scale (Multi-Tenant)

One machine running the full platform plus one or more models locally. Suitable for evaluations, edge sites, and small teams (up to ~100 concurrent requests).

Resource	Minimum	Recommended
CPU	16 cores	32 cores
Memory	96 GiB	128 GiB
Storage (NVMe SSD)	1 TiB	2 TiB
Inference accelerator	CPU-only (smaller / quantized models)	1 × 48–80 GB GPU for larger models or low-latency serving

The platform services alone use roughly 50–60 GiB of memory, so 64 GiB is not enough once a model and the operating system are added. Likewise, the fixed storage footprint already exceeds 200 GiB before model files — start at 1 TiB.

Smaller or quantized models can serve on CPU. A single 48–80 GB GPU comfortably serves one model in the ~20–30B-parameter range at low latency. Larger models or higher concurrency call for the clustered tier.

A highly-available platform across three or more nodes, with a separate inference worker pool. Suitable for production workloads up to ~1,000 concurrent requests and several models served at once.Platform nodes

Resource	Recommended
Node shape	3 × (24–48 vCPU / 64–128 GiB / 500 GiB–1 TiB NVMe)
Total CPU	72–144 vCPU — CPU leads for agent/event throughput; prefer the upper end and add headroom for bursts
Total memory	192–384 GiB (memory validated as ample at this tier)
Platform storage (node-local)	1.5–3 TiB aggregate (the per-node NVMe above) for databases, queue, and working set
Analytics & observability	Separate volumes sized to your retention window (grows with traffic) — see Storage planning
Model storage (registry)	Separate volume sized to your model catalog — see Storage planning

Inference worker nodes (separate pool)

Resource	Recommended
Accelerators	GPU, CPU, and/or HPU — chosen per model and SLO
Capacity	Sized to the models you serve; the platform recommends node type and count per model
Memory per GPU node	≥ 2× the node’s total GPU memory
Model volume	Shared ReadWriteMany storage class (NFS, AWS EFS, Azure Files, etc.) sized to the models the pool serves — one copy for the whole pool, not per node (see Storage planning)
Networking	10 Gbps between nodes

High availability comes from running three copies of the platform’s stateful services, which is the main increase over the single-node tier. Inference nodes can be a mix of types — for example, GPU nodes for latency-critical models alongside CPU nodes for smaller workloads.

A multi-tenant deployment for 10,000+ concurrent requests. The platform layer’s CPU grows with throughput (memory stays modest); inference capacity and data retention dominate.Platform

Resource	Recommended
Nodes	~6–8
Total CPU	128–192 vCPU — the binding resource at this scale; size to peak req/s (see the note under Choosing a deployment size)
Total memory	384–512 GiB (rarely the limit for the platform layer)

Inference capacity

Resource	Recommended
Inference nodes	Many, multi-node — a mix of GPU/CPU/HPU sized to your model catalog, traffic, and SLOs

Storage

Resource	Recommended
Platform + analytics	10–20 TiB+ — grows with request volume and retention
Model storage (registry)	Sized to your model catalog, separate from the above — see Storage planning

At this scale both the model registry and request-analytics/usage data are major storage drivers. Size each independently — see Storage planning below.

Storage planning

Storage falls into three independent components. They scale on completely different axes, so size each one separately rather than picking a single total:

Component	Scales with	Bounded?
Platform baseline	Fixed footprint — databases, queue, object-store metadata	Yes, roughly fixed
Model storage	Number and size of the models you onboard	No — grows with your catalog
Analytics & observability	Request rate × retention window	Yes — plateaus at the retention window

Platform baseline

Independent of traffic and of your model catalog, a deployment provisions storage for its databases, message queue, and object-store metadata. Plan for roughly 200–500 GiB for this layer before any model files or request data.

Model storage

The model registry — where downloaded model weights live — is usually the largest and least predictable part of total storage, and it scales with your model catalog, not with the deployment tier. Size it explicitly. Per model, registry size ≈ parameters × bytes-per-parameter × variants kept:

Precision	Bytes/param	8B	70B	405B
bf16 / fp16	2	~16 GB	~140 GB	~810 GB
fp8 / int8	1	~8 GB	~70 GB	~405 GB
int4 (quantized)	0.5	~4 GB	~35 GB	~200 GB

Bud often keeps more than one variant of a model — for example the original weights plus a quantized copy — so multiply by the number of variants you retain. Example registry sizes. Most catalogs fall into a few tiers. Use these as a starting point, then refine with the formula above for your exact model list:

Registry tier	Example catalog	Approx. catalog size	Suggested registry capacity
Evaluation	One 8B chat model (bf16) + a small embedding model	~20 GB	50 GiB
Small catalog	A few 7–32B models, one of them quantized, + embeddings and a reranker	~50–100 GB	200 GiB
Production	A 70B model kept as both bf16 and int4, + 8–32B models and embeddings	~250–500 GB	500 GiB – 1 TiB
Large / multi-tenant	A quantized 405B model + several 70B/32B variants + a multimodal model + embeddings	~1–3 TB	2–5 TiB+

The suggested registry capacity column is what to provision for the registry volume and set as MODEL_REGISTRY_MAX_SIZE; it adds headroom above the raw catalog size for variants you add later and for the local download cache. Model weights occupy storage in up to three places; budget for all of them:

Registry (durable copy) — one copy per model variant in the object store (SeaweedFS by default; any S3-compatible store — Ceph / rustfs — works). This is what the registry budget, MODEL_REGISTRY_MAX_SIZE, governs.
Local download cache — staging on the model-registry volume while a model is fetched, before it is uploaded to the registry. Provision at least your largest model × the number of concurrent downloads.
Inference model volume — when a model is deployed, its weights are placed on a volume the inference pool mounts. Use a shared (ReadWriteMany) storage class — NFS, AWS EFS, Azure Files, or similar — so the pool keeps one copy of each model regardless of node count; size it to the models that pool serves. (A node-local ReadWriteOnce volume works for single-node serving, but then each node needs its own copy.) Note the trade-off: shared network-attached storage (NFS, AWS EFS) saves space but can slow model load times at pod startup/scale-up compared with node-local NVMe — back it with fast storage, or use a node-local cache for latency-sensitive cold starts.

The registry runs a pre-flight capacity check before every download, so an over-full registry fails fast instead of part-way through a multi-gigabyte upload. Set MODEL_REGISTRY_MAX_SIZE to your provisioned registry capacity and grow the backing volume as your catalog grows — see Helm Configuration.

Analytics & observability growth

Usage data grows with traffic but is bounded by retention windows, so it reaches a steady state rather than growing forever. Two kinds of data accrue per request and scale very differently — size them separately:

Data	What it is	Scales with	Default retention
Inference analytics	One metadata row per request (tokens, latency, model, status)	requests/sec	90 days
Observability traces	Raw spans — ~8 per LLM request, ~15 per agent invocation	spans/sec (≈ req/s × spans-per-request)	30 days (configurable)
Usage metrics	Aggregated dashboards and billing rollups	requests/sec	90 days

Traces are normally the dominant driver. A request isn’t one record — it’s a tree of spans (gateway + model + agent orchestration). At a measured ~385 bytes per span, an agent invocation’s ~15 spans is ~5–6 KB of trace data, versus ~0.1–0.2 KB for its metadata row. Size the two terms separately:

fact_bytes/day   ≈ bytes_per_request × requests/sec × 86,400
trace_bytes/day  ≈ spans_per_request × bytes_per_span × requests/sec × 86,400
steady-state     ≈ (fact + trace) bytes/day × retention-days × replication (3) × ~1.3 merge headroom

As a guide, at a sustained 1,000 requests/second (metadata-level — no content capture; provisioned across 3 HA copies):

Component	Steady-state (3 copies)
Inference analytics — metadata rows (90-day)	~5 TB
Observability traces — bare LLM requests, ~8 spans (30-day)	~30 TB
Observability traces — agent workloads, ~15 spans (30-day)	~55–60 TB
+ full request/response content logging	stacks ~2–3 KB per content-bearing span on top

Traces carry a 30-day window and scale with spans-per-request, so the observability retention window and span volume — not the metadata rows — are your primary storage levers. Scale linearly for other rates (100 req/s ≈ one tenth).

Measured on a 100M-request reference dataset: the per-request metadata fact is ~126 bytes/row (11.8 GiB compressed per replica per 100M requests), while raw trace spans are ~385 bytes each and agent invocations average ~15 spans — so traces are ~90% of per-request storage. The biggest levers are the observability retention window, span sampling, and keeping content capture off unless you need it.

Enabling full request/response content logging substantially increases storage. If you turn it on, set an appropriate retention window first and provision accordingly.

Keeping storage predictable

Prune unused models and stale quantized variants from the registry — for most deployments the model catalog, not request volume, is the largest storage driver.
Tune the observability retention window to your needs (shorter = less storage).
Keep full request/response logging off unless you need it, and bound it with a retention window when you do.
Use premium SSD/NVMe for databases and the model registry; standard SSD is fine for general application data.
For very large datasets, scale the analytics database horizontally rather than relying on replication alone.

Scaling

Platform services

Memory is a largely fixed baseline (databases, queue, monitoring) plus a small per-request working set (~1–2 MiB per concurrent request, measured on the event path) — provision it generously, but it rarely leads.
CPU scales with request throughput and is usually the binding constraint for high-throughput agent/event paths — size it to peak req/s with headroom, not as an afterthought. Every service pod also runs a Dapr sidecar, so per-pod CPU (and the total across many replicas) is higher than the app container alone.
Stateless services support horizontal autoscaling (enable it per service for high availability and burst handling). The public event edge (budevent) holds one in-flight turn per concurrent request (~1 MiB each) and caps per pod, so roughly ~10,000 concurrent needs ~20 edge replicas — scale the edge out alongside throughput, and keep a warm minReplicas floor so a sudden spike lands on enough pods before autoscaling catches up (~60–90 s).
Run three copies of stateful services for high availability in the clustered and large-scale tiers. Note the shared databases (Postgres connection limit; the analytics store) are a stateful ceiling that scaling stateless replicas does not lift — scale the datastore itself for high req/s.

Inference nodes

Add inference nodes to increase serving capacity. Choose the accelerator — GPU, CPU, or HPU — based on each model and its SLO; the platform recommends the node type and configuration at deployment time.
Scale out by adding model replicas across inference nodes as concurrency grows.
Keep inference nodes in a separate node pool from the platform services so the two scale independently, and mix node types to match each model.

Networking

Traffic	Minimum	Recommended
Between nodes	5 Gbps	10–40 Gbps (higher for inference pools)
Internet ingress/egress	1 Gbps	5 Gbps

Next steps

Installation Guide

Deploy the platform on Kubernetes

Helm Configuration

Configure resources, retention, and services

Deployment

Deployment options and workflows

​Overview

Platform services

Inference nodes

​Choosing inference hardware

​Choosing a deployment size

​Storage planning

​Platform baseline

​Model storage

​Analytics & observability growth

​Keeping storage predictable

​Scaling

​Platform services

​Inference nodes

​Networking

​Next steps

Installation Guide

Helm Configuration

Deployment

Overview

Choosing inference hardware

Choosing a deployment size

Storage planning

Platform baseline

Model storage

Analytics & observability growth

Keeping storage predictable

Scaling

Platform services

Inference nodes

Networking

Next steps