Performance & Reliability

Latency Benchmarks

All measurements are from production traffic on NVIDIA A10G GPUs running Mistral 7B v0.2 in 4-bit GPTQ quantization. These are real numbers from live deployments, not synthetic benchmarks. Latency varies by prompt length, output length, and concurrent load.

Latency by specialist type

Specialist	Avg Input	Avg Output	P50	P95	P99
Classification	150 tokens	15 tokens	89ms	145ms	210ms
Entity extraction	300 tokens	80 tokens	165ms	280ms	380ms
Summarization	800 tokens	200 tokens	290ms	480ms	620ms
Q&A	500 tokens	150 tokens	240ms	390ms	510ms
Chat (single turn)	200 tokens	300 tokens	310ms	520ms	680ms
Code generation	400 tokens	500 tokens	420ms	650ms	890ms

Alveare vs OpenAI GPT-3.5 Turbo latency comparison

Task	Alveare P50	GPT-3.5 P50	Difference
Classification (short output)	89ms	350-500ms	3-5x faster
Entity extraction	165ms	500-800ms	3-5x faster
Summarization	290ms	800-1200ms	3-4x faster
Chat (single turn)	310ms	600-1000ms	2-3x faster

OpenAI latency varies significantly by load. Measured during business hours from US East Coast. Alveare latency is more consistent due to dedicated infrastructure with no noisy neighbors.

Throughput

Throughput depends on your plan tier, model size, and the number of hives allocated. Each hive processes requests in parallel across its GPU allocation. Adding hives scales throughput linearly.

Throughput by plan tier

Plan	Hives	Sustained req/s	Burst req/s	GPU Allocation
Starter	1	12 req/s	25 req/s	1x A10G (24 GB)
Professional	3	36 req/s	75 req/s	3x A10G (72 GB)
Scale	10	120 req/s	250 req/s	10x A10G (240 GB)
Enterprise	Custom	Custom	Custom	A100 / H100 available

Uptime SLA

Alveare commits to 99.95% monthly uptime for Professional and Scale plans, and 99.9% for Starter plans. Enterprise plans can negotiate 99.99% SLAs. If we miss the target, you receive service credits automatically -- no support ticket required.

SLA credit schedule

Monthly Uptime	Service Credit	Max Downtime
99.95% - 99.99%	Within SLA (no credit)	~22 minutes
99.0% - 99.95%	10% credit	~7 hours
95.0% - 99.0%	25% credit	~36 hours
Below 95.0%	50% credit	36+ hours

Uptime History (Last 12 Months)

Mar 2026

99.98%

Feb 2026

99.99%

Jan 2026

99.97%

Dec 2025

100.0%

Nov 2025

99.96%

Oct 2025

99.99%

Sep 2025

99.92%

Aug 2025

99.98%

Jul 2025

99.99%

Jun 2025

99.97%

May 2025

100.0%

Apr 2025

99.98%

Live uptime data available at status.alveare.ai

Self-Healing Architecture

Alveare's supervision tree architecture is borrowed from Erlang/OTP, the system that powers telecom infrastructure running at 99.999% uptime. Every specialist process runs under a supervisor. If a specialist crashes -- due to OOM, malformed input, GPU error, or any runtime failure -- the supervisor restarts it automatically in under 100 milliseconds.

Health Monitoring

Beyond crash recovery, the health monitor continuously tracks output quality metrics for each specialist. It detects degradation before it affects your users and takes corrective action automatically.

Latency drift detection. If P50 latency exceeds the configured threshold for 5 consecutive minutes, the monitor triggers a model reload or specialist restart.
Output format validation. For specialists with structured output (JSON, labels), the monitor validates that 95%+ of outputs match the expected schema. Below that threshold, it triggers an alert.
Error rate tracking. If more than 2% of requests return errors in a 5-minute window, the supervisor restarts the specialist. If restarts do not resolve the issue, the circuit breaker activates and alerts the operations team.
Token efficiency monitoring. A sudden increase in average output tokens may indicate prompt injection or model degradation. The monitor flags anomalies that exceed 2 standard deviations from the rolling average.

Auto-Scaling

Alveare's orchestration layer monitors request queue depth, GPU utilization, and latency metrics to scale your hive capacity automatically. When demand increases, additional GPU instances are provisioned from the spot fleet. When demand decreases, excess capacity is released to keep costs low.

Scaling Triggers

Queue depth. When the request queue exceeds 50 pending requests for more than 30 seconds, a new instance is provisioned. Target queue depth: under 10 requests.
GPU utilization. When average GPU utilization exceeds 85% for 5 minutes, scaling is triggered. Target utilization: 60-80%.
Latency threshold. When P95 latency exceeds the plan-specific threshold (500ms for Starter, 400ms for Professional/Scale), scaling is triggered regardless of GPU utilization.
Predictive scaling. For customers with regular traffic patterns (e.g., business hours spike), the orchestrator pre-provisions capacity 15 minutes before the expected spike based on 7-day rolling traffic patterns.

Scale-up time is typically 45-90 seconds (instance provisioning + model loading + warmup). During scale-up, existing instances continue serving traffic. Scale-down happens gradually with a 10-minute cooldown to prevent oscillation.

Spot Termination Handling

Alveare runs inference on GPU spot instances to deliver 60-80% lower infrastructure costs compared to on-demand pricing. Spot instances can be reclaimed by AWS with a 2-minute warning. Alveare handles this transparently with zero dropped requests and zero downtime.

Spot Termination Handling Timeline T+0s AWS issues 2-minute termination notice Alveare orchestrator receives interrupt signal T+1s Stop routing new requests to terminating instance Route to standby instance (already warm, model loaded) T+2-5s In-flight requests on terminating instance complete (avg completion time: 300ms, max: 2s) T+5s All traffic serving from standby instance Request new spot instance for standby pool T+45-90s New standby instance provisioned and warmed Fleet back to full capacity Result: 0 dropped requests, 0 downtime User-visible impact: none

The standby pool ensures there is always a warm instance ready to accept traffic. The default standby ratio is 1 standby per 2 active instances. For customers requiring the highest availability, the ratio can be increased to 1:1. Standby instances are also spot instances, so the cost overhead is minimal.

Caching Layer

Alveare includes a built-in response cache that stores results for identical requests. The cache sits between the API gateway and the inference layer, returning cached responses in under 10ms without consuming GPU compute.

Cache performance by workload type

Workload	Typical Hit Rate	Cache Hit Latency	Effective Savings
Classification (repetitive inputs)	25-35%	<5ms	25-35% fewer GPU requests
Entity extraction	15-20%	<5ms	15-20% fewer GPU requests
Summarization	10-15%	<5ms	10-15% fewer GPU requests
Chat (unique conversations)	1-3%	<5ms	Minimal

Cache TTL is configurable per specialist (default: 1 hour, range: 0 to 24 hours). The cache key is a SHA-256 hash of the specialist name, prompt text, and all generation parameters. Changing any parameter produces a cache miss, ensuring you always get fresh results when parameters change. Cached responses do not count against your monthly request allocation.

Global Infrastructure

Alveare currently operates in AWS us-east-1 (N. Virginia) with eu-west-1 (Ireland) planned for Q2 2026. Enterprise customers can request deployment in any AWS region with GPU instance availability, including ap-northeast-1 (Tokyo) and ap-southeast-1 (Singapore).

us-east-1 (N. Virginia)

Primary region. All plan tiers available. A10G, A100, and H100 GPU types. Full standby pool. This is the lowest-latency option for North American customers.

eu-west-1 (Ireland)

Planned Q2 2026. EU data residency for GDPR compliance. A10G and A100 GPU types. Professional and Scale plans. Contact sales for early access.

Custom Regions

Enterprise customers can deploy in any AWS region with GPU availability. Multi-region active-active or active-passive configurations with automatic failover.

For customers outside the US, latency from Europe to us-east-1 is typically 80-120ms round trip. The eu-west-1 region will reduce this to 10-30ms for European users. The API gateway uses anycast routing to direct traffic to the nearest edge location for TLS termination, minimizing connection overhead.

Performance & Reliability

Latency Benchmarks

Latency by specialist type

Alveare vs OpenAI GPT-3.5 Turbo latency comparison

Throughput

Throughput by plan tier

Uptime SLA

SLA credit schedule

Uptime History (Last 12 Months)

Self-Healing Architecture

Health Monitoring

Auto-Scaling

Scaling Triggers

Spot Termination Handling

Caching Layer

Cache performance by workload type

Global Infrastructure

us-east-1 (N. Virginia)

eu-west-1 (Ireland)

Custom Regions

See the performance for yourself