Infrastructure

Performance & Reliability

Production-grade inference with sub-300ms P50 latency, 99.95% uptime SLA, self-healing architecture, and zero-downtime spot termination handling. Measured, not marketed.

<300ms
P50 latency
for standard requests
99.95%
Uptime SLA
with service credits
<100ms
Specialist restart
via supervision trees
0
Dropped requests
during spot termination

Latency Benchmarks

All measurements are from production traffic on NVIDIA A10G GPUs running Mistral 7B v0.2 in 4-bit GPTQ quantization. These are real numbers from live deployments, not synthetic benchmarks. Latency varies by prompt length, output length, and concurrent load.

Latency by specialist type

Specialist Avg Input Avg Output P50 P95 P99
Classification 150 tokens 15 tokens 89ms 145ms 210ms
Entity extraction 300 tokens 80 tokens 165ms 280ms 380ms
Summarization 800 tokens 200 tokens 290ms 480ms 620ms
Q&A 500 tokens 150 tokens 240ms 390ms 510ms
Chat (single turn) 200 tokens 300 tokens 310ms 520ms 680ms
Code generation 400 tokens 500 tokens 420ms 650ms 890ms

Alveare vs OpenAI GPT-3.5 Turbo latency comparison

Task Alveare P50 GPT-3.5 P50 Difference
Classification (short output) 89ms 350-500ms 3-5x faster
Entity extraction 165ms 500-800ms 3-5x faster
Summarization 290ms 800-1200ms 3-4x faster
Chat (single turn) 310ms 600-1000ms 2-3x faster

OpenAI latency varies significantly by load. Measured during business hours from US East Coast. Alveare latency is more consistent due to dedicated infrastructure with no noisy neighbors.


Throughput

Throughput depends on your plan tier, model size, and the number of hives allocated. Each hive processes requests in parallel across its GPU allocation. Adding hives scales throughput linearly.

Throughput by plan tier

Plan Hives Sustained req/s Burst req/s GPU Allocation
Starter 1 12 req/s 25 req/s 1x A10G (24 GB)
Professional 3 36 req/s 75 req/s 3x A10G (72 GB)
Scale 10 120 req/s 250 req/s 10x A10G (240 GB)
Enterprise Custom Custom Custom A100 / H100 available

Uptime SLA

Alveare commits to 99.95% monthly uptime for Professional and Scale plans, and 99.9% for Starter plans. Enterprise plans can negotiate 99.99% SLAs. If we miss the target, you receive service credits automatically -- no support ticket required.

SLA credit schedule

Monthly Uptime Service Credit Max Downtime
99.95% - 99.99% Within SLA (no credit) ~22 minutes
99.0% - 99.95% 10% credit ~7 hours
95.0% - 99.0% 25% credit ~36 hours
Below 95.0% 50% credit 36+ hours

Uptime History (Last 12 Months)

Mar 2026
99.98%
Feb 2026
99.99%
Jan 2026
99.97%
Dec 2025
100.0%
Nov 2025
99.96%
Oct 2025
99.99%
Sep 2025
99.92%
Aug 2025
99.98%
Jul 2025
99.99%
Jun 2025
99.97%
May 2025
100.0%
Apr 2025
99.98%

Live uptime data available at status.alveare.ai


Self-Healing Architecture

Alveare's supervision tree architecture is borrowed from Erlang/OTP, the system that powers telecom infrastructure running at 99.999% uptime. Every specialist process runs under a supervisor. If a specialist crashes -- due to OOM, malformed input, GPU error, or any runtime failure -- the supervisor restarts it automatically in under 100 milliseconds.

SUPERVISION TREE Root Supervisor API Gateway Hive Pool Health Monitor classify OK summa- rise OK extract Auto-restart <100ms RECOVERY TIMELINE 0ms Crash detected <100ms Specialist restarted <5s Model reloaded (if needed) <3min Instance failover (if needed) 99.7% of crashes recovered in <100ms | Exponential backoff | Circuit breaker at 5 failures/60s

Health Monitoring

Beyond crash recovery, the health monitor continuously tracks output quality metrics for each specialist. It detects degradation before it affects your users and takes corrective action automatically.


Auto-Scaling

Alveare's orchestration layer monitors request queue depth, GPU utilization, and latency metrics to scale your hive capacity automatically. When demand increases, additional GPU instances are provisioned from the spot fleet. When demand decreases, excess capacity is released to keep costs low.

Scaling Triggers

Scale-up time is typically 45-90 seconds (instance provisioning + model loading + warmup). During scale-up, existing instances continue serving traffic. Scale-down happens gradually with a 10-minute cooldown to prevent oscillation.


Spot Termination Handling

Alveare runs inference on GPU spot instances to deliver 60-80% lower infrastructure costs compared to on-demand pricing. Spot instances can be reclaimed by AWS with a 2-minute warning. Alveare handles this transparently with zero dropped requests and zero downtime.

Spot Termination Handling Timeline T+0s AWS issues 2-minute termination notice Alveare orchestrator receives interrupt signal T+1s Stop routing new requests to terminating instance Route to standby instance (already warm, model loaded) T+2-5s In-flight requests on terminating instance complete (avg completion time: 300ms, max: 2s) T+5s All traffic serving from standby instance Request new spot instance for standby pool T+45-90s New standby instance provisioned and warmed Fleet back to full capacity Result: 0 dropped requests, 0 downtime User-visible impact: none

The standby pool ensures there is always a warm instance ready to accept traffic. The default standby ratio is 1 standby per 2 active instances. For customers requiring the highest availability, the ratio can be increased to 1:1. Standby instances are also spot instances, so the cost overhead is minimal.


Caching Layer

Alveare includes a built-in response cache that stores results for identical requests. The cache sits between the API gateway and the inference layer, returning cached responses in under 10ms without consuming GPU compute.

Cache performance by workload type

Workload Typical Hit Rate Cache Hit Latency Effective Savings
Classification (repetitive inputs) 25-35% <5ms 25-35% fewer GPU requests
Entity extraction 15-20% <5ms 15-20% fewer GPU requests
Summarization 10-15% <5ms 10-15% fewer GPU requests
Chat (unique conversations) 1-3% <5ms Minimal

Cache TTL is configurable per specialist (default: 1 hour, range: 0 to 24 hours). The cache key is a SHA-256 hash of the specialist name, prompt text, and all generation parameters. Changing any parameter produces a cache miss, ensuring you always get fresh results when parameters change. Cached responses do not count against your monthly request allocation.


Global Infrastructure

Alveare currently operates in AWS us-east-1 (N. Virginia) with eu-west-1 (Ireland) planned for Q2 2026. Enterprise customers can request deployment in any AWS region with GPU instance availability, including ap-northeast-1 (Tokyo) and ap-southeast-1 (Singapore).

us-east-1 (N. Virginia)

Primary region. All plan tiers available. A10G, A100, and H100 GPU types. Full standby pool. This is the lowest-latency option for North American customers.

eu-west-1 (Ireland)

Planned Q2 2026. EU data residency for GDPR compliance. A10G and A100 GPU types. Professional and Scale plans. Contact sales for early access.

Custom Regions

Enterprise customers can deploy in any AWS region with GPU availability. Multi-region active-active or active-passive configurations with automatic failover.

For customers outside the US, latency from Europe to us-east-1 is typically 80-120ms round trip. The eu-west-1 region will reduce this to 10-30ms for European users. The API gateway uses anycast routing to direct traffic to the nearest edge location for TLS termination, minimizing connection overhead.

See the performance for yourself

Start a 7-day free trial and measure latency, throughput, and uptime against your own workload. No credit card required.

Get Started Free