Skip to content

Entitybase Backend Architecture

Immutable Revision Architecture (Vitess + S3)

This document describes a clean-room, billion-scale Entitybase architecture based on immutable S3 snapshots, Vitess indexing, and a well-defined API boundary.

Core invariant

A revision is an immutable snapshot stored in S3. Once written, it never changes.

There are: - No mutable revisions - No diff storage - No page-based state - No MediaWiki-owned content

Everything else in the system derives from this rule.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Client Layer                             │
│  (Browser, Mobile Apps, SPARQL Queries, External Systems)        │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    REST API Layer (FastAPI)                     │
│  - Entity CRUD endpoints                                        │
│  - Type-specific endpoints (items, properties, lexemes)        │
│  - Statement management                                          │
│  - User features (watchlist, thanks, endorsements)              │
│  - RDF export (Turtle, RDF XML, NTriples)                      │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                     Service Layer                               │
│  - Entity operations (create, update, delete, revert)        │
│  - Statement deduplication                                      │
│  - Lexeme term processing                                       │
│  - User activity tracking                                       │
│  - Statistics computation                                        │
│  - RDF generation and diffing                                   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                   Repository Layer                             │
│  - VitessRepository (metadata, indexing)                       │
│  - S3Repository (immutable content)                             │
│  - StreamRepository (Kafka events)                             │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│              Infrastructure Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │    S3        │  │   Vitess     │  │   Kafka      │         │
│  │  (Content)   │  │ (Metadata)   │  │ (Streaming)  │         │
│  └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘

System Components

1. REST API Layer

Main Application: FastAPI application with async/await support

API Version: /v1/entitybase (configurable via api_prefix)

Endpoint Categories:

Entity CRUD

  • GET /entities/{entity_id} - Get entity (JSON/TTL/RDF)
  • GET /entities/{entity_id}/history - Get revision history
  • GET /entities/{entity_id}/revision/{revision_id} - Get specific revision
  • DELETE /entities/{entity_id} - Delete entity

Type-Specific Creation

  • GET /entities/items - Create empty item (headers only, returns entity_id)
  • GET /entities/properties - Create empty property (headers only, returns entity_id)
  • POST /entities/lexemes - Create lexeme (requires body with lemmas, language, lexical_category)
  • PUT /entities/items/{item_id} - Update item
  • PUT /entities/properties/{property_id} - Update property
  • PUT /entities/lexemes/{lexeme_id} - Update lexeme

Statement Management

  • GET /entities/{entity_id}/properties - List unique properties
  • GET /entities/{entity_id}/properties/{list} - Get property hashes
  • POST /entities/{entity_id}/properties/{property_id} - Add statements
  • DELETE /entities/{entity_id}/statements/{hash} - Remove statement
  • PATCH /entities/{entity_id}/statements/{hash} - Patch statement
  • GET /statements/{hash} - Get statement by hash
  • POST /statements/batch - Batch get statements
  • GET /statements/most_used - Get most used statements
  • POST /statements/cleanup-orphaned - Cleanup orphaned statements

Terms Management

  • GET /entities/items/{item_id}/labels/{lang} - Get item label
  • PUT /entities/items/{item_id}/labels/{lang} - Update item label
  • DELETE /entities/items/{item_id}/labels/{lang} - Delete item label
  • Similar for descriptions, aliases, and lexemes
  • GET /entities/{entity_id}/sitelinks/{site} - Get sitelink
  • POST /entities/{entity_id}/sitelinks/{site} - Add sitelink
  • PUT /entities/{entity_id}/sitelinks/{site} - Update sitelink
  • DELETE /entities/{entity_id}/sitelinks/{site} - Delete sitelink

Redirects

  • POST /redirects - Create redirect
  • POST /entities/{id}/revert-redirect - Revert redirect

Revert

  • POST /entities/{entity_id}/revert - Revert to previous revision

User Features

  • GET /users/{user_id} - Get user info
  • GET /users/{user_id}/activity - Get user activity
  • POST /users/{user_id}/watchlist - Add to watchlist
  • GET /users/{user_id}/watchlist - Get watchlist
  • POST /users/{user_id}/thank - Send thank

Endorsements

  • POST /statements/{hash}/endorse - Endorse statement
  • DELETE /statements/{hash}/endorse - Withdraw endorsement
  • GET /statements/{hash}/endorsements - Get endorsements

Statistics

  • GET /stats - Get general statistics
  • GET /health - Health check

Documentation: See ENDPOINTS.md for complete endpoint list.

2. Service Layer

Entity Services

  • EntityHandler: Base handler for all entity operations
  • EntityCreateHandler: Entity creation logic
  • EntityUpdateHandler: Entity update logic
  • EntityDeleteHandler: Entity deletion logic
  • RedirectHandler: Redirect management
  • RevertHandler: Revision revert logic

Statement Services

  • StatementService: Statement deduplication and storage
  • StatementHandler: Statement CRUD operations
  • SnakHandler: Snak deduplication

Lexeme Services

  • LexemeHandler: Lexeme CRUD operations
  • LexemeFormHandler: Form management
  • LexemeSenseHandler: Sense management

User Services

  • UserHandler: User operations
  • ThanksHandler: Thanks feature
  • EndorsementHandler: Endorsement feature
  • WatchlistHandler: Watchlist management
  • UserStatsService: User statistics computation
  • GeneralStatsService: General statistics computation

RDF Services

  • EntityConverter: Convert entities to RDF
  • EntityDiffWorker: Compute RDF diffs between revisions
  • RDFSerializer: Serialize entities to RDF formats

ID Generation

  • EnumerationService: Range-based ID allocation

3. Repository Layer

Vitess Repositories

  • EntityRepository: Entity metadata
  • HeadRepository: Head revision tracking
  • RevisionRepository: Revision metadata
  • StatementRepository: Statement content tracking
  • BacklinkRepository: Backlink tracking
  • RedirectRepository: Redirect management
  • UserRepository: User data
  • ThanksRepository: Thanks data
  • EndorsementRepository: Endorsement data
  • WatchlistRepository: Watchlist data
  • ListingRepository: Entity listings
  • TermsRepository: Term metadata
  • LexemeRepository: Lexeme-specific operations
  • MetadataRepository: Entity metadata (flags)

S3 Repositories

  • RevisionStorage: Revision snapshots
  • StatementStorage: Statement content
  • ReferenceStorage: Reference content
  • QualifierStorage: Qualifier content
  • SnakStorage: Snak content
  • MetadataStorage: Term and sitelink metadata
  • LexemeStorage: Lexeme forms and senses

Stream Repositories

  • EntityChangeStreamProducer: Publish entity change events
  • EntityDiffStreamProducer: Publish RDF diff events
  • Consumer: Kafka event consumer for watchlist

Documentation: See REPOSITORIES.md for detailed repository documentation.

4. Background Workers

ID Generation Worker

  • File: src/models/workers/id_generation/id_generation_worker.py
  • Purpose: Reserves ID ranges for high-throughput entity creation
  • Schedule: Continuous (no scheduled interval)

Entity Diff Worker

  • File: src/models/workers/entity_diff/entity_diff_worker.py
  • Purpose: Computes RDF diffs between entity revisions
  • Output: Streams to wikibase.entity_diff Kafka topic
  • File: src/models/workers/backlink_statistics/backlink_statistics_worker.py
  • Purpose: Computes backlink statistics for entities
  • Schedule: Daily at 2 AM (0 2 * * *)

User Stats Worker

  • File: src/models/workers/user_stats/user_stats_worker.py
  • Purpose: Computes daily user statistics
  • Schedule: Daily at 2 AM (0 2 * * *)

General Stats Worker

  • File: src/models/workers/general_stats/general_stats_worker.py
  • Purpose: Computes daily general wiki statistics
  • Schedule: Daily at 2 AM (0 2 * * *)

Watchlist Consumer Worker

  • File: src/models/workers/watchlist_consumer/main.py
  • Purpose: Consumes entity change events, creates watchlist notifications
  • Consumes: Kafka topic entitybase.entity_change

Notification Cleanup Worker

  • File: src/models/workers/notification_cleanup/main.py
  • Purpose: Cleans up old watchlist notifications
  • Schedule: Configurable

Dev Worker

  • File: src/models/workers/dev/__main__.py
  • Purpose: Development tools (bucket creation, table creation)
  • Commands: create_buckets, create_tables

Documentation: See WORKERS.md for detailed worker documentation.

5. Data Flow

Entity Creation Flow

Client
  ↓ POST /entities/items
API Handler (EntityCreateHandler)
  ↓ Validate JSON schema
EnumerationService
  ↓ Allocate next Q-ID (e.g., Q123)
CreationTransaction
  ↓ Process statements (hash, deduplicate, store to S3)
  ↓ Store terms (hash, store to S3)
  ↓ Store sitelinks (hash, store to S3)
  ↓ Create revision snapshot (hash, store to S3)
Vitess (within transaction)
  ↓ Insert entity_revisions record
  ↓ Update entity_head
  ↓ Update statement_content ref_counts
Stream (optional)
  ↓ Publish entity change event
Response
  ↓ Return entity_id and revision_id

Entity Read Flow

Client
  ↓ GET /entities/Q123
API Handler
  ↓ Query entity_head for head_revision_id
Vitess
  ↓ Get revision metadata (including content_hash)
S3
  ↓ Load revision by content_hash
  ↓ Load hash-referenced content:
    - Terms (labels, descriptions, aliases)
    - Sitelinks
    - Statements (with referenced snaks, qualifiers, references)
  ↓ Reconstruct full entity
Response
  ↓ Return entity JSON

Statement Deduplication Flow

Entity Create/Update
  ↓ Extract statements from request
StatementService.deduplicate_and_store_statements
  ↓ For each statement:
    - Hash statement content (mainsnak + qualifiers + references)
    - Check if exists in statement_content table
    - If new: store to S3, insert record (ref_count=1)
    - If exists: increment ref_count
  ↓ Replace statement with hash reference in entity
Revision Storage
  ↓ Store revision with statement_hashes array

6. Storage Architecture

S3 Storage

  • Revisions: Immutable snapshots stored by content_hash
  • Statements: Deduplicated, stored by hash
  • References: Deduplicated, stored by hash
  • Qualifiers: Deduplicated, stored by hash
  • Snaks: Deduplicated, stored by hash
  • Terms: Deduplicated, stored by hash (UTF-8 text)
  • Sitelinks: Deduplicated, stored by hash (UTF-8 text)
  • Lexeme Forms: Deduplicated, stored by hash
  • Lexeme Senses: Deduplicated, stored by hash

Schema Versions: - Entity: 2.0.0 (hash-based responses) - Revision: 4.0.0 (full deduplication) - Statement: 1.0.0 (hash-referenced snaks)

Documentation: See STORAGE-ARCHITECTURE.md for complete storage architecture.

Vitess Storage

  • Entity metadata: entity_head, entity_revisions, metadata
  • Statement tracking: statement_content (hash + ref_count)
  • ID allocation: id_ranges
  • User features: users, user_thanks, user_statement_endorsements, user_watchlist, watchlist_notifications
  • Statistics: user_daily_stats, general_daily_stats, backlink_statistics
  • Other: entity_redirects, entity_backlinks, lexeme_terms

Documentation: See STORAGE-ARCHITECTURE.md for complete Vitess schema.

7. Key Features

Content Deduplication

  • Statements: Deduplicated across all entities
  • References: Deduplicated across statements
  • Qualifiers: Deduplicated across statements
  • Snaks: Deduplicated across statements, qualifiers, references
  • Terms: Deduplicated across entities (labels, descriptions, aliases)
  • Sitelinks: Deduplicated across entities

Storage Savings: ~90% reduction compared to inline storage.

Documentation: See STATEMENT-DEDUPLICATION.md for details.

Range-Based ID Generation

  • Pre-allocates ID ranges for high-throughput creation
  • Worker ensures ranges always available
  • Atomic ID claims with confirmation
  • No coordination required for allocation

Documentation: See ENTITY-MODEL.md for complete ID generation details.

Social Features

  • Thanks: Thank users for specific revisions
  • Endorsements: Endorse statements to signal trust
  • Watchlist: Track entity changes with notifications

Documentation: See separate sections in ARCHITECTURE/ for each feature.

RDF Export

  • Formats: Turtle, RDF XML, NTriples
  • Diffing: RDF diffs between revisions
  • Streaming: Real-time RDF change events
  • Canonicalization: URDNA2015 for consistent blank node handling

Documentation: See RDF-BUILDER/ directory for RDF architecture.

Statistics

  • User Stats: Daily user activity statistics
  • General Stats: Daily wiki-wide statistics
  • Backlink Stats: Periodic backlink counts

Documentation: See STATISTICS.md for statistics architecture.

8. Configuration

All settings managed via environment variables:

Database: VITESS_HOST, VITESS_PORT, VITESS_DATABASE, VITESS_USER, VITESS_PASSWORD

Storage: S3_ENDPOINT, S3_ACCESS_KEY, S3_SECRET_KEY, S3_*_BUCKET

Streaming: KAFKA_BROKERS, STREAMING_ENABLED, KAFKA_*_TOPIC

API: API_PREFIX, ENTITY_VERSION

Workers: _STATS_ENABLED, _STATS_SCHEDULE

Documentation: See CONFIGURATION.md for complete configuration reference.

9. Transaction Model

Atomicity

  • All Vitess operations wrapped in database transactions
  • S3 operations tracked for rollback
  • Ref counts ensure consistency

Isolation

  • Read committed isolation level
  • No dirty reads, non-repeatable reads prevented

Durability

  • Vitess: Durable storage (MySQL)
  • S3: Durable object storage
  • Kafka: Durable event streaming

Consistency

  • Eventual consistency for stats workers
  • Strong consistency for entity CRUD operations
  • Ref counting ensures statement consistency

10. Performance Characteristics

Scalability

  • Horizontal scaling: S3 and Kafka scale horizontally
  • Vertical scaling: Vitess can be sharded
  • Throughput: Supports thousands of operations per second

Latency

  • Entity read: ~200-500ms (including hash content loading)
  • Entity create: ~300-600ms (including S3 writes)
  • Statement read: ~50-150ms
  • RDF export: ~500-1000ms per entity

Storage Efficiency

  • Deduplication: ~90% storage savings
  • Compression: Optional for S3 objects
  • CDN caching: Enabled for public buckets

11. Security Considerations

Authentication

  • User authentication via MediaWiki tokens (not implemented yet)
  • API keys for external access (not implemented yet)

Authorization

  • Edit rights via MediaWiki user permissions (not implemented yet)
  • Protected entities via semi_protected, locked, mass_edit_protected flags

Data Protection

  • No sensitive data in logs
  • Secure credential management via environment variables
  • S3 bucket access control

12. Monitoring and Observability

Health Checks

  • GET /health - API health check
  • Worker health endpoints
  • Database connection monitoring
  • S3 connectivity monitoring

Logging

  • Structured logging with request IDs
  • Log levels: DEBUG, INFO, WARNING, ERROR
  • Audit logging for entity operations (optional)

Metrics

  • Request latency histograms
  • Error rate tracking
  • Storage usage monitoring
  • Worker execution timing