Skip to content

Vitess Database Cost Estimation

Overview

This document estimates Vitess database and Kubernetes cluster costs for storing metadata and indexing for 1 billion Wikibase entities at scale.

Important Note: All calculations use AWS GKE pricing for estimates. Expect at least 10% overhead when running on Wikimedia Foundation (WMF) infrastructure due to additional operational costs, internal services, and compliance requirements.

Executive Summary

Vitess strategy: Store pointers and metadata only (entity content in S3)

Key findings: - Total data: 5.05 TB (Year 1) โ†’ 12.6 TB (Year 10) - Cluster: 16 nodes, 44 cores, 112 GB RAM - Sharding: Start with 24 shards, scale to 64 by Year 10 - Query load: 50 QPS peak (Wikidata's 10 edits/sec rate) - Annual cost: ~$22,496 (plus 10% WMF overhead = ~$24,746) - 10-year cost: ~$224,960 (plus 10% WMF overhead = ~$247,456)

Comparison to S3 storage costs: - Vitess: ~2% of S3 storage costs at Year 10 scale - Highly cost-effective as metadata/indexing layer

Data Model Analysis

Vitess Tables (from STORAGE-ARCHITECTURE.md)

Vitess stores pointers and metadata, never entity content (stored in S3).

Table Purpose Rows Row Size Total Size
entity_head Current revision pointer 1B 50 bytes 50 GB
entity_revisions Revision history 20B (avg 20/entity) 100 bytes 2.0 TB
entity_revision_meta Validation metadata 20B 150 bytes 3.0 TB
Total - - ~5.05 TB

Growth Model

Growth based on entity count:

Year Entities Revisions Total Data Per Shard (24 shards)
Year 1 1B 20B 5.05 TB ~210 GB
Year 2 1.1B 22B 5.56 TB ~232 GB
Year 3 1.21B 24.2B 6.11 TB ~255 GB
Year 5 1.5B 30B 7.58 TB ~316 GB
Year 10 2.5B 50B 12.6 TB ~525 GB

Vitess recommendation: 250GB per MySQL server (shard) for 15-minute recovery time from backup.

Sharding Strategy

Vitess Sharding Recommendations

From Vitess official documentation: - Recommended shard size: 250GB per MySQL server - Reasoning: Fast recovery times (15 minutes), manageable backup/restore operations - Hard limit: 64TB per InnoDB tablespace (not practical limit)

Initial Sharding (Year 1)

Starting configuration: - Total data: 5.05 TB - Shard size target: 250GB - Shards needed: 5.05 TB รท 250 GB = 20.2 - Deployment: 24 shards (for headroom and balanced distribution)

Per-shard capacity: - Data per shard: 5.05 TB รท 24 = ~210 GB - Headroom: 40 GB (16%) - Status: Comfortable for 1-2 years without resharding

Future Sharding (Year 10)

Growth to 2.5B entities: - Total data: 12.6 TB - Shards needed: 12.6 TB รท 250 GB = 50.4 - Deployment: 64 shards (power-of-two scaling)

Resharding strategy: - Incremental: 24 โ†’ 32 โ†’ 48 โ†’ 64 shards - Timeline: Reshard every 2-3 years as data grows - Method: Vitess built-in resharding tools

Query/Write Load Estimation

Wikidata Edit Rate

Actual Wikidata measurements: - Edits/sec: ~10 (measured) - Edits/day: ~864,000 - Edits/year: ~315 million

Write Operations

Per edit: 1. Validate entity JSON 2. Assign next revision_id 3. Write snapshot to S3 4. Insert revision metadata into Vitess (1 row) 5. Update entity_head (1 row, CAS operation) 6. Emit change event

Write QPS: - Entity revision inserts: 10 QPS - Entity head updates: 10 QPS - Validation metadata: 10 QPS - Total writes: 30 QPS - Peak with 2x headroom: 60 QPS

Read Operations

API request patterns (from CACHING-STRATEGY.md): - 80% cache hit rate (conservative) - 20% queries reach Vitess - CDN cache reduces S3 GET operations by 80%

Read QPS: - Total API requests: 50 QPS - Cache hits (Valkey + application cache): 40 QPS (80%) - Vitess queries: 10 QPS (20% cache miss) - Worst case (no cache): 50 QPS to Vitess

Total Vitess QPS

Scenario Reads Writes Total
Normal (80% cache hit) 10 30 40
Peak (50% cache hit) 25 60 85
Worst case (0% cache hit) 50 60 110

Kubernetes Cluster Requirements

Vitess Component Sizing

From Vitess FAQ and best practices:

Component Purpose Cores per Unit Units Total Cores
VTGate Query routing 2-4 2 4
VTTablet + MySQL Shard storage 1-2 24 36
vtctld Schema management 1 1 1
vtorc Tablet orchestration 0.5 1 0.5
Total - - 41.5

Memory sizing (Vitess recommendation: 4 GB per 250GB shard):

Component RAM per Unit Units Total RAM
VTGate 4 GB 2 8 GB
VTTablet + MySQL 4 GB 24 96 GB
vtctld 2 GB 1 2 GB
vtorc 2 GB 1 2 GB
Total - - 108 GB

Pod Configuration

Pod requirements:

Component Replicas CPU Memory Storage per Pod
VTGate 2 2 cores 4 GB -
VTTablet Primary 12 1.5 cores 4 GB 600 GB
VTTablet Replica 12 1.5 cores 4 GB 600 GB
vtctld 1 1 core 2 GB -
vtorc 1 0.5 cores 2 GB -
Total 28 41.5 cores 108 GB -

Node Requirements

Node sizing (conservative):

| Node Type | Count | CPU | Memory | Storage | Purpose | |-----------|--------|------|---------|---------| | Database nodes | 12 | 4 cores ร— 12 = 48 cores | 8 GB ร— 12 = 96 GB | 1 primary + 1 replica per shard | | Application nodes | 2 | 8 cores ร— 2 = 16 cores | 16 GB ร— 2 = 32 GB | VTGate, vtctld, vtorc | | Total | 14 nodes | 64 cores | 128 GB | - |

Recommendation: Use 16 nodes (2 additional for maintenance/HA)

Storage per Node

Database node storage: - Shard size: 250 GB (recommended maximum) - MySQL data directory: 250 GB - Binlog/WAL: 50 GB - Temporary space: 50 GB - Total per node: ~600 GB

Total cluster storage: - 12 nodes ร— 600 GB = 7.2 TB

Note: Backups stored separately (not included in node sizing)

Cost Estimation

Compute Costs (GKE, us-central1)

Pricing: n1-standard-4 (4 vCPU, 16 GB RAM) at $0.10/hour

Node Type Count Hourly Cost Monthly Cost
Database nodes 12 12 ร— $0.10 = $1.20 $876
Application nodes 2 2 ร— $0.10 = $0.20 $146
Total 14 $1.40/hour $1,022/month

With 16 nodes (including 2 extra): - Hourly: 16 ร— $0.10 = $1.60 - Monthly: $1.60 ร— 730 = $1,168/month

Storage Costs (GKE)

Pricing: pd-ssd at $0.10/GB/month

Storage Type Size Monthly Cost
Database node storage 12 nodes ร— 600 GB = 7.2 TB $720
Total 7.2 TB $720/month

Network Costs

Estimated traffic: - Internal cluster communication: ~10 GB/day - External API traffic: ~5 GB/day - S3 fetch traffic: ~20 GB/day - Total: ~35 GB/day = ~1 TB/month

Network cost: - Egress: $0.12/GB ร— 1,000 GB = $120/month

Total Monthly Cost

Component Monthly Cost Annual Cost
Compute (16 nodes) $1,168 $14,016
Storage (7.2 TB) $720 $8,640
Network $120 $1,440
Total $2,008 $24,096

WMF Infrastructure Overhead

Required overhead: Minimum 10%

Reasons for overhead: - Internal network infrastructure - Operational support and monitoring - Compliance and security requirements - Redundancy and backup systems - Shared infrastructure costs

Total annual cost with 10% overhead:

$24,096 ร— 1.10 = $26,506/year

10-Year Cost Projection

Vitess costs (constant each year): - Annual cost: $24,096 - 10-year cost: $240,960

With 10% WMF overhead: - Annual: $26,506 - 10-year: $265,060

Cost Comparison

Vitess vs. S3 Storage

From STORAGE-COST-ESTIMATIONS.md:

Cost Component Annual Cost 10-Year Cost
S3 storage (Year 10 scale) $227,832 $2.28M
Vitess (AWS) $24,096 $240,960
Vitess (WMF +10%) $26,506 $265,060
Vitess % of S3 11.6% 11.6%

Conclusion: Vitess represents ~12% of S3 storage costs, highly efficient as metadata/indexing layer.

Baseline: Single-Node Database

Comparison to non-sharded approach:

| Approach | Nodes | CPU | RAM | Annual Cost | 10-Year Cost | |----------|--------|------|-------------|--------------| | Single MySQL (no sharding) | 1 | 16 cores | $1,920 | $19,200 | | Vitess (24 shards) | 16 | 64 cores | $26,506 | $265,060 |

Analysis: - Vitess costs 13.8x more than single node - But provides: Horizontal scalability, fault isolation, resharding capability, built-in replication - Single node cannot scale to 1B entities (physical limits) - Conclusion: Vitess cost premium is justified by scalability benefits

Scaling Strategy

Phase 1: Start (Weeks 1-4)

Initial deployment: - 12 shards - 12 database nodes (6 primary + 6 replica) - 2 application nodes - Handle up to 30 QPS - Support 500M entities

Phase 2: Scale (Months 2-6)

Expand to 24 shards: - 24 database nodes (12 primary + 12 replica) - 4 application nodes - Handle up to 85 QPS (peak) - Support 1B entities

Phase 3: Future (Years 2-10)

Incremental resharding: - 24 โ†’ 32 shards (Year 2-3) - 32 โ†’ 48 shards (Year 4-6) - 48 โ†’ 64 shards (Year 7-10) - Support 2.5B entities by Year 10

Resharding Costs

Per resharding operation: - Temporary nodes for migration: 4 nodes ร— 1 week = $384 - Downtime: Near-zero (Vitess live resharding) - **Annual resharding cost: ~$384 (every 2-3 years)

Performance Characteristics

Query Latency

From CACHING-STRATEGY.md:

Layer Latency Hit Rate
Valkey (cache) <1ms 80%
Vitess (cache miss) 100-200ms 20%
S3 (entity fetch) 200-500ms 16% (after Vitess miss)

Bottleneck Analysis

Capacity: - VTGate: 4 cores = ~8,000 QPS capacity - VTTablet: 36 cores = ~54,000 QPS capacity - Required: 40-110 QPS - Headroom: 73x (VTGate), 491x (VTTablet)

Conclusion: Significant headroom with 24 shards and 80% cache hit rate.

Monitoring Requirements

Key Metrics

Vitess-specific metrics: - Shard health and replication lag (<5 seconds target) - Query latency (p50 <100ms, p95 <200ms, p99 <500ms) - Cache hit rates (>80% target) - CPU/memory utilization per pod (<70% target) - Disk I/O per shard - Network throughput between pods

Kubernetes metrics: - Pod restart rates - Node resource utilization - Pod pending/unschedulable count - Network latency between nodes

Alerting Thresholds

Vitess alerts: - Replication lag >10 seconds - Query latency p99 >500ms for 5+ minutes - CPU >70% for 10+ minutes - Disk usage >80%

Kubernetes alerts: - Pod crash loop >3 restarts - Node not ready >5 minutes - Resource quota exceeded

Risk Assessment

High-Risk Items

1. Resharding complexity - Risk: Large-scale resharding operations can fail or cause issues - Probability: Low (Vitess mature tooling) - Impact: Medium (downtime, data inconsistency) - Mitigation: Test thoroughly, gradual resharding, maintain backups

Medium-Risk Items

1. Cache hit rate lower than expected - Risk: Actual cache hit rate <80%, increasing Vitess load - Probability: Medium (conservative estimate) - Impact: Medium (higher Vitess QPS, need more resources) - Mitigation: Monitor closely, scale VTTablets if needed

2. Replication lag - Risk: Replicas fall behind primary, stale reads - Probability: Low (Vitess built-in replication) - Impact: Medium (stale data, consistency issues) - Mitigation: Monitor lag, use primary for critical reads, scale replicas

Low-Risk Items

1. Cost overruns - Risk: Actual costs higher than estimates - Probability: Low (conservative sizing) - Impact: Low (manageable cost scale) - Mitigation: Monitor monthly spend, right-size resources

2. Resource exhaustion - Risk: Insufficient CPU/RAM for workload - Probability: Low (73x headroom calculated) - Impact: Low (scale horizontally, add shards) - Mitigation: Autoscale policies, capacity planning

Recommendations

Primary Recommendation

Deploy Vitess with 24 shards starting from day 1

Key benefits: - Immediate support for 1B entities - 16% headroom per shard (210 GB vs. 250 GB target) - No immediate resharding required - Manageable annual cost: ~$26,506 - 12% of S3 storage costs (highly efficient)

Secondary Recommendations

  1. Start with conservative cache hit rate (80%) - Plan for higher Vitess load initially
  2. Monitor actual cache hit rates - Adjust resource allocation based on real data
  3. Plan for resharding by Year 2-3 - Don't wait until shards hit limits
  4. Single-region deployment initially - Add multi-region only if disaster recovery needs require it
  5. Use GKE autoscaling - Enable HPA for VTGate based on QPS

Implementation Priority

Phase 1 (Immediate): - Deploy 12-shard Vitess cluster - Configure replication (1 primary + 1 replica) - Set up monitoring and alerting - Configure query caching

Phase 2 (Months 1-3): - Expand to 24 shards - Monitor performance at scale - Tune cache parameters - Optimize query plans

Phase 3 (Months 3-12): - Prepare resharding plan for Year 2-3 - Establish backup strategy - Document operational procedures - Train team on Vitess operations

Summary

Total annual cost: ~$26,506 (GKE pricing + 10% WMF overhead) 10-year cost: ~$265,060 Cost relative to S3: ~12% of S3 storage costs at Year 10 scale Cluster size: 16 nodes, 64 cores, 128 GB RAM, 24 shards Capacity: 40-110 QPS (73-491x headroom) Scalability: Horizontal scaling to 64 shards by Year 10

Conclusion: Vitess is highly cost-effective as the metadata/indexing layer, providing horizontal scalability and fault tolerance at a fraction of S3 storage costs. The 24-shard starting configuration provides immediate support for 1B entities with headroom for growth.


Document version: 1.0 Last updated: January 1, 2026 Author: Backend team Status: Draft for review

Related documents: - STORAGE-COST-ESTIMATIONS.md - S3 storage cost analysis - STORAGE-ARCHITECTURE.md - Vitess database design - CACHING-STRATEGY.md - Cache layer architecture - SCALING-PROPERTIES.md - Scaling characteristics