Entity Model
Entity
An entity is a logical identifier representing a Wikibase item, property, lexeme, or entity schema.
entity_id(e.g. Q123, P42, L999, E100)entity_type(item, property, lexeme, entityschema)
Entities have no intrinsic state outside of revisions.
External Entity Identifier
The external entity ID (e.g., Q123, P42, L999, E100) is the permanent public-facing identifier used throughout the ecosystem:
- API endpoints:
/entities/Q123,/entities/items,/entities/properties - SPARQL queries:
SELECT ?item WHERE { ?item wdt:P31 wd:Q123 } - RDF/JSON data: Cross-entity references in claims use Q123
- RDF triples:
<http://www.wikidata.org/entity/Q42> a wikibase:Item - RDF change events:
entity_id: "Q42"in event schemas - S3 paths: Human-readable inspection (e.g.,
s3://wikibase-revisions/Q123/42.json)
Characteristics:
- Human-readable: Easy to communicate and debug
- Stable: Never changes, permanent contract with external ecosystem
- Compatible: Works with all Wikidata tools, SPARQL queries, existing datasets
- Semantic: Prefix indicates entity type (Q=item, P=property, L=lexeme, E=entityschema)
Range-Based ID Generation
Entity IDs are generated using a range-based allocation system for efficient, high-throughput ID generation.
ID Ranges Table
The id_ranges table manages ID allocation:
CREATE TABLE id_ranges (
entity_type VARCHAR(50) PRIMARY KEY,
current_range_start BIGINT UNSIGNED NOT NULL,
current_range_end BIGINT UNSIGNED NOT NULL,
next_id BIGINT UNSIGNED NOT NULL,
range_size BIGINT UNSIGNED NOT NULL DEFAULT 1000,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
entity_type: Entity type (item, property, lexeme, entityschema)current_range_start: Start of the current reserved rangecurrent_range_end: End of the current reserved rangenext_id: Next ID to allocaterange_size: Number of IDs in each reserved range
ID Generation Process
Initialization (via Dev Worker):
INSERT INTO id_ranges (entity_type, current_range_start, current_range_end, next_id, range_size)
VALUES
('item', 1, 1000, 1, 1000),
('property', 1, 1000, 1, 1000),
('lexeme', 1, 1000, 1, 1000),
('entityschema', 1, 1000, 1, 1000);
ID Allocation (via IdGenerationWorker):
1. Worker polls id_ranges table
2. If next_id >= current_range_end, reserve new range: UPDATE id_ranges SET current_range_start = current_range_end + 1, current_range_end = current_range_end + range_size, next_id = current_range_start WHERE entity_type = ?
3. Atomically increment and claim ID: UPDATE id_ranges SET next_id = next_id + 1 WHERE entity_type = ? AND next_id = ?
4. Format ID with prefix (Q/P/L/E)
5. Cache in memory for fast allocation
Atomic ID Claim (during entity creation): 1. Check cached ID availability 2. If available, claim and use 3. If exhausted, request new range from worker 4. Confirm ID usage after successful entity creation
EnumerationService
The EnumerationService provides thread-safe ID allocation:
class EnumerationService:
def allocate_id(self, entity_type: EntityType) -> str:
"""Allocate the next available ID for an entity type."""
# Returns formatted ID (e.g., "Q123")
Features: - Thread-safe ID allocation - In-memory caching for low-latency allocation - Automatic range reservation when cache exhausted - Confirmation mechanism to reclaim unused IDs on failure
Entity ID vs Internal ID
The current architecture uses entity_id strings throughout the system:
Vitess Tables
All tables use entity_id as the primary key or foreign key:
entity_head
- entity_id VARCHAR(50) PRIMARY KEY
- head_revision_id BIGINT UNSIGNED NOT NULL
- updated_at TIMESTAMP NOT NULL
entity_revisions
- entity_id VARCHAR(50) NOT NULL
- revision_id BIGINT UNSIGNED NOT NULL
- content_hash BIGINT UNSIGNED NOT NULL
- PRIMARY KEY (entity_id, revision_id)
entity_redirects
- entity_id VARCHAR(50) PRIMARY KEY
- redirects_to VARCHAR(50) NOT NULL
statement_content
- content_hash BIGINT UNSIGNED PRIMARY KEY
- ref_count BIGINT UNSIGNED NOT NULL DEFAULT 1
No entity_id_mapping table exists. Entity IDs are used directly.
Advantages of Current Design
- Simplicity: No translation layer needed
- Human-readable: All database queries use familiar Q123/P42 IDs
- Debugging: Easier to inspect database state
- Compatibility: Direct mapping to Wikidata IDs
Usage Examples
Entity Creation
Client Request
POST /entities/items
{
"type": "item",
"labels": {"en": {"language": "en", "value": "Douglas Adams"}},
"claims": {...}
}
↓
API Layer (EntityCreateHandler)
1. Validate JSON schema (Pydantic)
2. Create CreationTransaction
3. Allocate next ID via EnumerationService (returns "Q123")
4. Register entity in Vitess (entity_head)
5. Process statements: hash, deduplicate, store in S3/Vitess (increment ref_counts)
6. Create revision: store in S3/Vitess (entity_revisions)
7. Publish change event (optional)
8. Commit transaction (confirm ID usage to worker)
9. On failure: Rollback all operations (delete from Vitess/S3, decrement ref_counts)
↓
Client Response
{
"id": "Q123",
"revision_id": 1,
"created_at": "2025-01-15T10:30:00Z"
}
Entity Read
Client Request
GET /entities/Q123
↓
API Layer
1. Query entity_head:
SELECT head_revision_id, updated_at FROM entity_head WHERE entity_id = "Q123"
2. Query entity_revisions for content_hash:
SELECT content_hash FROM entity_revisions WHERE entity_id = "Q123" AND revision_id = ?
3. Fetch S3 snapshot:
GET s3://wikibase-revisions/123456789
4. Reconstruct entity from S3 data + hash references
↓
Client Response
{
"id": "Q123",
"revision_id": 42,
"entity": {...entity content...}
}
Property Creation
Client Request
POST /entities/properties
{
"type": "property",
"datatype": "string",
"labels": {"en": {"language": "en", "value": "name"}}
}
↓
API Layer (PropertyCreateHandler)
1. Validate request
2. Allocate property ID via EnumerationService (returns "P42")
3. Register property in Vitess
4. Process statements (no claims for properties)
5. Create revision
6. Commit transaction
↓
Client Response
{
"id": "P42",
"revision_id": 1,
"created_at": "2025-01-15T11:00:00Z"
}
Storage Example
S3 Object Path (content_hash-based):
s3://wikibase-revisions/123456789
Vitess Tables:
entity_head:
entity_id: "Q123" (PRIMARY KEY)
head_revision_id: 42
updated_at: "2025-01-15T10:30:00Z"
entity_revisions:
entity_id: "Q123"
revision_id: 42
content_hash: 123456789
created_at: "2025-01-15T10:30:00Z"
id_ranges:
entity_type: "item"
current_range_start: 1
current_range_end: 1000
next_id: 124
Key Principles
- ID Stability: Q123 never changes, maintains 100% ecosystem compatibility
- Range-Based Allocation: Efficient ID generation with minimal database contention
- Worker-Assisted: IdGenerationWorker pre-allocates ranges for low-latency allocation
- Direct Usage: Entity IDs used directly in all tables (no mapping layer)
- Prefix Semantics: Q/P/L/E prefixes indicate entity type throughout system
ID Generation Worker
File: src/models/workers/id_generation/id_generation_worker.py
Purpose: Background worker that reserves ID ranges to ensure high-throughput entity creation.
Process:
1. Monitors id_ranges table
2. When next_id approaches current_range_end, reserves new range
3. Continuously polls to ensure ranges are always available
4. Provides health checks for monitoring
Configuration:
- WORKER_ID: Unique worker identifier (default: auto-generated)
Historical Note
Previous documentation described a hybrid ID strategy with ulid-flake internal IDs and an entity_id_mapping table. This was never implemented. The current architecture uses entity_id strings directly with range-based allocation, which is simpler and more maintainable.
References
- STORAGE-ARCHITECTURE.md - S3 and Vitess storage design
- ARCHITECTURE.md - Overall system architecture
- WORKERS.md - Worker architecture including IdGenerationWorker