Production-grade inference with sub-300ms P50 latency, 99.95% uptime SLA, self-healing architecture, and zero-downtime spot termination handling. Measured, not marketed.
All measurements are from production traffic on NVIDIA A10G GPUs running Mistral 7B v0.2 in 4-bit GPTQ quantization. These are real numbers from live deployments, not synthetic benchmarks. Latency varies by prompt length, output length, and concurrent load.
| Specialist | Avg Input | Avg Output | P50 | P95 | P99 |
|---|---|---|---|---|---|
| Classification | 150 tokens | 15 tokens | 89ms | 145ms | 210ms |
| Entity extraction | 300 tokens | 80 tokens | 165ms | 280ms | 380ms |
| Summarization | 800 tokens | 200 tokens | 290ms | 480ms | 620ms |
| Q&A | 500 tokens | 150 tokens | 240ms | 390ms | 510ms |
| Chat (single turn) | 200 tokens | 300 tokens | 310ms | 520ms | 680ms |
| Code generation | 400 tokens | 500 tokens | 420ms | 650ms | 890ms |
| Task | Alveare P50 | GPT-3.5 P50 | Difference |
|---|---|---|---|
| Classification (short output) | 89ms | 350-500ms | 3-5x faster |
| Entity extraction | 165ms | 500-800ms | 3-5x faster |
| Summarization | 290ms | 800-1200ms | 3-4x faster |
| Chat (single turn) | 310ms | 600-1000ms | 2-3x faster |
OpenAI latency varies significantly by load. Measured during business hours from US East Coast. Alveare latency is more consistent due to dedicated infrastructure with no noisy neighbors.
Throughput depends on your plan tier, model size, and the number of hives allocated. Each hive processes requests in parallel across its GPU allocation. Adding hives scales throughput linearly.
| Plan | Hives | Sustained req/s | Burst req/s | GPU Allocation |
|---|---|---|---|---|
| Starter | 1 | 12 req/s | 25 req/s | 1x A10G (24 GB) |
| Professional | 3 | 36 req/s | 75 req/s | 3x A10G (72 GB) |
| Scale | 10 | 120 req/s | 250 req/s | 10x A10G (240 GB) |
| Enterprise | Custom | Custom | Custom | A100 / H100 available |
Alveare commits to 99.95% monthly uptime for Professional and Scale plans, and 99.9% for Starter plans. Enterprise plans can negotiate 99.99% SLAs. If we miss the target, you receive service credits automatically -- no support ticket required.
| Monthly Uptime | Service Credit | Max Downtime |
|---|---|---|
| 99.95% - 99.99% | Within SLA (no credit) | ~22 minutes |
| 99.0% - 99.95% | 10% credit | ~7 hours |
| 95.0% - 99.0% | 25% credit | ~36 hours |
| Below 95.0% | 50% credit | 36+ hours |
Live uptime data available at status.alveare.ai
Alveare's supervision tree architecture is borrowed from Erlang/OTP, the system that powers telecom infrastructure running at 99.999% uptime. Every specialist process runs under a supervisor. If a specialist crashes -- due to OOM, malformed input, GPU error, or any runtime failure -- the supervisor restarts it automatically in under 100 milliseconds.
Beyond crash recovery, the health monitor continuously tracks output quality metrics for each specialist. It detects degradation before it affects your users and takes corrective action automatically.
Alveare's orchestration layer monitors request queue depth, GPU utilization, and latency metrics to scale your hive capacity automatically. When demand increases, additional GPU instances are provisioned from the spot fleet. When demand decreases, excess capacity is released to keep costs low.
Scale-up time is typically 45-90 seconds (instance provisioning + model loading + warmup). During scale-up, existing instances continue serving traffic. Scale-down happens gradually with a 10-minute cooldown to prevent oscillation.
Alveare runs inference on GPU spot instances to deliver 60-80% lower infrastructure costs compared to on-demand pricing. Spot instances can be reclaimed by AWS with a 2-minute warning. Alveare handles this transparently with zero dropped requests and zero downtime.
The standby pool ensures there is always a warm instance ready to accept traffic. The default standby ratio is 1 standby per 2 active instances. For customers requiring the highest availability, the ratio can be increased to 1:1. Standby instances are also spot instances, so the cost overhead is minimal.
Alveare includes a built-in response cache that stores results for identical requests. The cache sits between the API gateway and the inference layer, returning cached responses in under 10ms without consuming GPU compute.
| Workload | Typical Hit Rate | Cache Hit Latency | Effective Savings |
|---|---|---|---|
| Classification (repetitive inputs) | 25-35% | <5ms | 25-35% fewer GPU requests |
| Entity extraction | 15-20% | <5ms | 15-20% fewer GPU requests |
| Summarization | 10-15% | <5ms | 10-15% fewer GPU requests |
| Chat (unique conversations) | 1-3% | <5ms | Minimal |
Cache TTL is configurable per specialist (default: 1 hour, range: 0 to 24 hours). The cache key is a SHA-256 hash of the specialist name, prompt text, and all generation parameters. Changing any parameter produces a cache miss, ensuring you always get fresh results when parameters change. Cached responses do not count against your monthly request allocation.
Alveare currently operates in AWS us-east-1 (N. Virginia) with eu-west-1 (Ireland) planned for Q2 2026. Enterprise customers can request deployment in any AWS region with GPU instance availability, including ap-northeast-1 (Tokyo) and ap-southeast-1 (Singapore).
Primary region. All plan tiers available. A10G, A100, and H100 GPU types. Full standby pool. This is the lowest-latency option for North American customers.
Planned Q2 2026. EU data residency for GDPR compliance. A10G and A100 GPU types. Professional and Scale plans. Contact sales for early access.
Enterprise customers can deploy in any AWS region with GPU availability. Multi-region active-active or active-passive configurations with automatic failover.
For customers outside the US, latency from Europe to us-east-1 is typically 80-120ms round trip. The eu-west-1 region will reduce this to 10-30ms for European users. The API gateway uses anycast routing to direct traffic to the nearest edge location for TLS termination, minimizing connection overhead.
Start a 7-day free trial and measure latency, throughput, and uptime against your own workload. No credit card required.
Get Started Free