Statement Deduplication
Executive Summary
First-class statement-level revision tracking at 1 trillion statement scale. MVP greenfield implementation with zero migration requirements.
Key Architectural Decisions
- Statement IDs: Hash-only (no S-IDs, no mapping table)
- Statement Granularity: Full statement block (mainsnak + qualifiers + references)
- Deletion Lifecycle: Statements live forever (accessible via entity revision history)
- Deduplication: Automatic via hash PRIMARY KEY (rapidhash)
- Entity Revisions: Store hash arrays + properties + property_counts
- Property-Based Loading: Intelligent frontend loading, demand-fetch only
- Most-Used Statements:
/statement/most_usedendpoint for scientific analysis - Hard Delete Cleanup: Ref-count tracking with 180-day grace period
- No Migration: Optimal architecture from day one (MVP greenfield)
Conceptual Model
Entity
Logical identifier only, no intrinsic state outside revisions.
entity_id(e.g., Q123)entity_type(item, property, lexeme, …)
Statement
First-class citizen with stable identifier (hash).
statement_hash(rapidhash of statement JSON): 64-bit integer- Immutable snapshot of claims data including:
- mainsnak
- type
- rank
- qualifiers
- references
Lifecycle
Creation
- Hash computed:
rapidhash(statement_json) - S3 object created:
statements/{hash}.json - Database record:
INSERT INTO statement_content (content_hash, created_at)
Deduplication
If same statement content appears:
1. Hash already exists in statement_content table
2. S3 object already exists
3. Only ref_count increment needed
Usage Tracking
When statement added to entity revision:
1. Entity revision stores hash: statements: [hash1, hash2, ...]
2. ref_count in statement_content incremented
Deletion (Entity Hard-Delete)
ref_countdecremented for all statements used by entity- If
ref_countreaches 0: cleanup job schedules deletion (180-day grace period) - Statement object deleted from S3 and database
- Statement remains accessible via historical entity revisions
Storage Architecture
S3 Storage
Entity Revisions
s3://wikibase-revisions/entity/Q42/rev100.json
{
"schema_version": "1.0.0",
"revision_id": 100,
"created_at": "2026-01-05T10:00:00Z",
"entity_id": "Q42",
"statements": [hash1, hash2, hash3],
"properties": ["P31", "P569", "P19"],
"property_counts": {"P31": 2, "P569": 1, "P19": 1},
"content_hash": 11223344556677889,
"metadata": {
"labels": {...},
"descriptions": {...}
}
}
Statements
s3://wikibase-statements/987654321012345678.json
{
"content_hash": 987654321012345678,
"statement": {
"mainsnak": {...},
"type": "statement",
"rank": "normal",
"qualifiers": {...},
"references": [...]
},
"created_at": "2026-01-05T09:00:00Z"
}
Storage Projections (1T statements)
- Unique statements: 800B
- Raw storage: 400 GB
- Compressed (6:1 Brotli): 67 GB
- Cost (S3 Standard): $1,541/month
- Cost (Intelligent-Tiering): ~$536/month
Vitess Schema
Statement Content
statement_content (
content_hash BIGINT PRIMARY KEY,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ref_count INT DEFAULT 1,
INDEX idx_ref_count (ref_count DESC)
)
Entity Revisions
entity_revisions (
entity_id BIGINT NOT NULL,
revision_id BIGINT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
statements JSON NOT NULL,
is_mass_edit BOOLEAN DEFAULT FALSE,
edit_type VARCHAR(100) DEFAULT '',
PRIMARY KEY (entity_id, revision_id),
INDEX idx_created (created_at)
)
Entity Head
entity_head (
entity_id BIGINT PRIMARY KEY,
head_revision_id BIGINT NOT NULL,
is_semi_protected BOOLEAN DEFAULT FALSE,
is_locked BOOLEAN DEFAULT FALSE,
is_archived BOOLEAN DEFAULT FALSE,
is_dangling BOOLEAN DEFAULT FALSE,
is_mass_edit_protected BOOLEAN DEFAULT FALSE,
is_deleted BOOLEAN DEFAULT FALSE,
is_redirect BOOLEAN DEFAULT FALSE,
redirects_to BIGINT NULL,
updated_at TIMESTAMP
)
Storage Projections (1T statements)
- statement_content: 80 GB
- entity_revisions: 1 TB
- entity_head: 50 GB
- Total: ~1.18 TB
API Surface
Entity Endpoints
GET /entity/{id}→ Metadata + statement hashes + properties + countsGET /entity/{id}/revision/{rev_id}→ Historical entity revisionGET /entity/{id}/properties→ Flat property IDs arrayGET /entity/{id}/properties/counts→ Statement counts per propertyGET /entity/{id}/properties/{prop_id}→ Hashes for specific property
Statement Endpoints
GET /statement/{hash}→ Full statement JSON from S3POST /statements/batch→ Batch fetch multiple statementsGET /statement/most_used→ Most referenced statements
Most-Used Query Parameters
limit=1000offset=0min_ref_count=10property_range=P0-P999sort_by=ref_count_desc
Orphaned Statement Cleanup
Hard Delete Flow
-- Step 1: Decrement ref_count
UPDATE statement_content
SET ref_count = ref_count - 1
WHERE content_hash IN (
SELECT content_hash
FROM entity_revisions
WHERE entity_id = Q42_internal_id
AND revision_id = Q42_rev_100
)
-- Step 2: Mark entity as deleted
UPDATE entity_head
SET is_deleted = TRUE
WHERE entity_id = Q42_internal_id
Implementation Phases
Phase 1: Database Schema
- Create
statement_contenttable - Modify
entity_revisionsto store hash arrays - Create S3 object models
Phase 2: Core Logic
- Implement hash computation and deduplication
- Implement entity revision creation with properties + counts
- Implement ref_count tracking
Phase 3: API Endpoints
- Entity endpoints with property-based filtering
- Statement endpoints with batch fetching
- Most-used statements query
Phase 4: Integration
- Frontend integration with demand-loading
- Hard delete with orphaned cleanup
- Performance testing and optimization
Success Criteria
- Database schema supports hash-based statements
- Entity revisions store hash arrays (not full statements)
- Statement deduplication working (same content = one S3 object)
- Property-based loading implemented
- Statement endpoints operational
- Most-used statements endpoint functional
- Hard delete with orphaned cleanup working
- Performance targets met (100ms reads, 500ms batch writes)
- Storage projections validated (1T statements achievable)
Performance Projections
Query Latency
GET /entity/Q42: 55-110msGET /statement/{hash}: 50-100msGET /statement/most_used: 110-210ms
Cost (1T statements)
- S3: $536-$1,541/month
- Vitess: ~$22,500/month
- Total: ~$23,000/month (Year 10)