RDF Endpoint Architecture and Caching Strategy

Overview

This document outlines the proposed architecture for a high-performance RDF endpoint that combines efficient caching with on-demand RDF generation, similar to MediaWiki Wikibase's approach.

Problem Statement

Building RDF for complex entities (e.g., Q42) requires: - Fetching multiple properties (293 for Q42) - Fetching referenced entities (hundreds) - Parsing entity data - Constructing complex RDF graphs - Serializing to Turtle format

Result: Response times that could be tens of seconds, which is not acceptable for an interactive API.

Solution: Hybrid Caching + On-Demand RDF Generation

Instead of pre-generating and caching RDF dumps, we use a 24-hour cache for property types, item labels, and entity metadata, combined with on-demand RDF generation for each request.

This approach: - ✅ Reduces response time dramatically (seconds → milliseconds) - ✅ Keeps data reasonably fresh (24h cache for metadata, live for RDF) - ✅ Controls infrastructure costs (cache hits are cheap, database queries are minimized) - ✅ Matches MediaWiki's approach (generate RDF on-demand, cache auxiliary data)

Architecture

┌─────────────────────────────────────────────────────────┐
│                 Client Request                   │
│                (GET /entity/Q123)              │
└──────────────────────┬──────────────────────────────┘
                     ↓
         ┌─────────────────────────────┐
         │ Browser/Client Cache        │
         │ (HTTP caching, ETags)       │
         └──────────┬────────────────────┘
                    ↓
            ┌────────────────────────┐
            │ CDN Cache (Optional) │
            │ 70-85% hit rate    │
            │ 1 year TTL            │
            │ S3 immutable snapshots  │
            └──────────┬─────────────┘
                       ↓
              (cache miss)
                 ↓
         ┌────────────────────────────────┐
         │ Application Object Cache     │
         │ Valkey / Memcached          │
         │ - entity_id_mapping         │
         │ - entity_head (NEW)        │
         │ - entity_metadata (NEW)     │
         │ TTL: 1 hour                │
         └──────────┬─────────────────────┘
                    ↓
         ┌────────────────────────────────┐
         │ Vitess Database             │
         │ - entity_id_mapping table     │
         │ - entity_head table (NEW)     │
         │ - entity_revisions table (NEW) │
         │ Query cache enabled              │
         └──────────┬─────────────────────┘
                    ↓
         (cache hit)
            ↓
         ┌────────────────────────────────┐
         │ RDF Builder                │
         │ (generate on-demand)           │
         └──────────┬─────────────────────┘
                    ↓
              Turtle RDF
                    ↓
         ┌────────────────────────────────┐
         │ S3 Object Store (NEW)       │
         │ Store for immutable snapshots   │
         └─────────────────────────────┘
                    ↓
            ┌────────────────────────────────┐
         │ API Response                │
         └─────────────────────────────┘

Cache Layers (from CACHING-STRATEGY.md)

Layer 1: Client Cache

HTTP caching headers (ETag, Last-Modified)
Cache-Control: public, max-age=3600
Max-age directives
Conditional requests (If-None-Match, If-Modified-Since)
Expected hit rate: 40-60%
TTL: 1 hour

Layer 2: CDN Cache (Optional)

S3 snapshots (immutable) for 1 year
Cache-Control: public, max-age=31536000 (1 year)
CDN CloudFront/Cloudflare edge caching
S3 GET operations only
Expected hit rate: 70-85% for hot entities
Use case: Serve static content without application involvement

Layer 3: Application Object Cache (Valkey/Memcached)

3.1 Entity ID Mapping Cache (NEW for our use case)

Purpose: Fast lookup of external ID → internal UUID translation

Cache key format:

entity_id:{external_id}

Examples:

entity_id:Q123
entity_id:P42
entity_id:L999

Cache value:

{
  "internal_id": 1424675744195114,
  "entity_type": "item",
  "created_at": "2025-01-15T10:30:00Z"
}

TTL: 3600 seconds (1 hour) - Rationale: External IDs never change, can cache indefinitely - Mappings rarely change: Only if entity is deleted and recreated

3.2 Entity Head Cache (NEW - Required for MediaWiki compatibility)

Purpose: Track current head revision pointer for entity

Cache key format:

entity_head:{internal_id}

Cache value:

{
  "internal_id": 1424675744195114,
  "head_revision_id": 42,
  "updated_at": "2025-01-15T10:30:00Z"
}

TTL: 300 seconds (5 minutes) - Rationale: Head revisions update frequently on active entities - TTL fallback: Ensures eventual consistency

3.3 Entity Metadata Cache (NEW)

Purpose: Cache entity labels, descriptions, types for search/browse endpoints

Cache key format:

entity_meta:{internal_id}

Cache value:

{
  "internal_id": 1424675744195114,
  "external_id": "Q123",
  "entity_type": "item",
  "labels": {
    "en": "Douglas Adams",
    "de": "Douglas Adams"
  },
  "descriptions": {
    "en": "English author"
  }
}

TTL: 1800 seconds (30 minutes) - Rationale: Labels/descriptions change less frequently than entity data - Optimization: Store in compressed format (gzip) to reduce memory

3.4 Vitess Query Cache (NEW)

Purpose: Cache frequently executed SQL queries at database level

Configuration:

-- Enable Vitess query cache
SET GLOBAL query_cache_size = 256M;
SET GLOBAL query_cache_type = ON;

-- Cache common lookups
-- (entity_id_mapping queries, entity_head lookups)

Expected hit rate: 30-40% for repetitive queries - TTL: Managed by Vitess (query plan cache)

Layer 4: S3 Object Store (NEW)

Purpose: System of record for all entity snapshots

Characteristics: - Immutable: Never modified after write - Cache-friendly: CDN caches for 1 year - Global replication: Multi-region S3 for low latency - No caching strategy needed: Immutable by design

Use case: - Store immutable entity snapshots (JSON) - Serve via CDN with 1-year cache - Reduce S3 GET operations

Storage pattern:

revisions/{external_id}/r{revision_id}.json

Examples:

revisions/Q123/r42.json
revisions/P42/r1.json

Layer 5: Vitess Database Query

Purpose: Source of truth for entity data when caches miss

Tables: - entity_id_mapping - External ID → internal ID translation - entity_head - Current head revision pointer - entity_revisions - Revision history (NEW) - entity_metadata - Labels, descriptions (separate table or materialized view)

Data Flow

Request Flow (Cache Hit)

Client Request
  ↓
Browser Cache HIT
  ↓
CDN Cache HIT (if used)
  ↓
Application Cache HIT (entity_id_mapping)
  ↓
Lookup internal_id = 1424675744195114
  ↓
Query Vitess for entity data (using internal_id)
  ↓
RDF Builder generates Turtle RDF
  ↓
Return to client (latency: <100ms total)

Request Flow (Cache Miss - New Entity)

Client Request
  ↓
Browser Cache MISS
  ↓
Application Cache MISS (entity_id_mapping)
  ↓
Query Vitess for entity ID mapping
  ↓
Insert new entity_id_mapping record (Q123 → 1424675744195114)
  ↓
Query Vitess for entity data (using internal_id)
  ↓
RDF Builder generates Turtle RDF
  ↓
Update Application Cache (entity_id_mapping)
  ↓
Return to client (latency: 200-500ms for database + RDF generation)

Cache Invalidation

Immutable Data (Entity ID Mapping)

Never invalidated - External IDs never change

Mutable Data (Entity Head, Entity Metadata)

Invalidated on: - Entity update (successful write) - Revision creation - Metadata change (label/description update)

Strategy:

def update_entity(entity_data: dict):
    external_id = entity_data['external_id']  # e.g., "Q123"
    internal_id = entity_data['internal_id']  # e.g., 1424675744195114

    # 1. Invalidate entity head cache
    entity_head_cache_key = f"entity_head:{internal_id}"
    valkey_client.delete(entity_head_cache_key)

    # 2. Invalidate entity metadata cache
    entity_meta_cache_key = f"entity_meta:{internal_id}"
    valkey_client.delete(entity_meta_cache_key)

    # 3. Write new S3 snapshot
    revision_id = get_next_revision_id()  # Generate new revision ID
    s3.put(f"revisions/{external_id}/r{revision_id}.json", entity_data)

    # 4. Update entity_head in Vitess
    db.update_entity_head(internal_id, revision_id)

    # 5. Update Application Cache
    cache_value = {
        "internal_id": internal_id,
        "head_revision_id": revision_id,
        "updated_at": str(datetime.now(timezone.utc))
    }
    cache_key = f"entity_head:{internal_id}"
    valkey_client.setex(cache_key, 3600, json.dumps(cache_value))  # 1 hour TTL

    return {"status": "success"}

Performance Targets

Expected Hit Rates (from CACHING-STRATEGY.md)

Cache Layer	Expected Hit Rate	Target P50 Latency	Target P99 Latency
Client cache	40-60%	0ms (local)	0ms (local)
CDN cache	70-85%	50ms	200ms
entity_id_mapping cache	>95%	<1ms	<5ms
entity_head cache	80-90%	<2ms	<10ms
Entity metadata cache	70-80%	<5ms	<20ms
Vitess query cache	30-40%	<10ms	<50ms

Full Request Latency (Warm Cache Path)

GET /entity/Q123 (hot entity, cached)
  ↓
Client cache HIT: <1ms
  ↓
App cache HIT (entity_id_mapping): <1ms
  ↓
Vitess cache HIT (entity data): <10ms
  ↓
RDF Builder (on-demand): 50-100ms
  ↓
Response time: 60-120ms (P50), 100-200ms (P99)

Full Request Latency (Cold Cache Path)

GET /entity/Q123 (cold entity)
  ↓
Client cache MISS
  ↓
CDN cache MISS
  ↓
App cache MISS (entity_id_mapping)
  ↓
Vitess query (entity ID mapping): 100-200ms
  ↓
Insert new mapping: 50ms
  ↓
App cache populate
  ↓
Vitess query (entity data): 100-200ms
  ↓
RDF Builder (on-demand): 50-100ms
  ↓
Update caches
  ↓
Response time: 300-500ms

Cost Optimization

Strategy 1: Prefer Cache Over Database

Rule: Check all cache layers before querying database

Cost impact: - Valkey: $0.01/10,000 operations (1 cent per 10K ops) - Vitess: $0.10/10,000 operations (1 cent per 10K ops) - 10x cheaper to cache than query database

Example cost comparison: - Database query: 100ms = 1 cent - Valkey lookup: 1ms = 0.01 cent - 100x cheaper to use cache

Strategy 2: Long TTLs for Immutable Data

Strategy: - S3 snapshots: 1 year (never invalidated) - entity_id_mapping: 1 hour (never invalidated) - entity_head: 5 minutes (rarely invalidated)

Cost impact: - Reduces database queries by >90% for hot entities (Q42, Q5, etc.) - Estimated savings: $500-2000/month at scale

Strategy 3: CDN Over Direct S3 Access

Strategy: - All public reads go through CDN (CloudFront/Cloudflare) - CDN caches S3 snapshots for 1 year - Application only serves cache misses (via S3)

Cost impact: - CDN: $0.085/GB vs S3: $0.09/GB - Data transfer cost similar, but performance much better - S3 operations reduced significantly

Strategy 4: Optimize Cache Memory Usage

Implementation: - Compress cached values (gzip) - Set appropriate TTL to prevent memory bloat - Monitor cache size and adjust TTLs

Example:

import gzip

def compress_cache_value(value: dict) -> bytes:
    """Compress cache value to reduce memory usage"""
    json_str = json.dumps(value)
    return gzip.compress(json_str.encode('utf-8'))

def decompress_cache_value(compressed: bytes) -> dict:
    """Decompress cache value"""
    return json.loads(gzip.decompress(compressed).decode('utf-8'))

Integration with Existing RDF Builder

Cache Layer Responsibilities

The Application Object Cache is responsible for: 1. Entity ID lookup (Q123 → 1424675744195114) 2. Entity head tracking (current revision for entity) 3. Entity metadata (labels, descriptions for search/browse) 4. NOT for storing generated RDF (that's on-demand)

The RDF Builder is responsible for: 1. On-demand RDF generation (reading from Vitess using internal_id) 2. No caching of generated RDF (client/browser handles this) 3. Fast response times using cached metadata

Data Access Pattern

# In API endpoint handler
def get_entity_rdf(external_id: str) -> str:
    """Generate RDF for entity using cached metadata"""

    # Step 1: Lookup internal_id from cache
    cache_key = f"entity_id:{external_id}"
    cached = valkey_client.get(cache_key)

    if cached:
        # Cache hit - use cached internal_id
        internal_id = json.loads(cached)['internal_id']
    else:
        # Cache miss - query Vitess
        mapping = db.query_one(
            "SELECT internal_id FROM entity_id_mapping WHERE external_id = %s",
            (external_id,)
        )
        if not mapping:
            raise NotFoundError(f"Entity {external_id} not found")

        internal_id = mapping['internal_id']

        # Populate cache
        cache_value = json.dumps({"internal_id": internal_id})
        valkey_client.setex(cache_key, 3600, cache_value)  # 1 hour TTL

    # Step 2: Fetch entity data from Vitess (using internal_id)
    entity_data = db.query_entity_by_internal_id(internal_id)

    # Step 3: Generate RDF on-demand
    rdf_builder = EntityConverter(
        property_registry=get_property_registry(),  # From cache
    )
    turtle = rdf_builder.convert_to_string(entity_data)

    # Step 4: Add HTTP caching headers
    headers = {
        "ETag": generate_etag(entity_data),
        "Last-Modified": entity_data['modified_at'],
        "Cache-Control": "public, max-age=3600"
    }

    return turtle, headers

Cache Warming Strategies

Initial Warm-up on Deployment

Purpose: Pre-populate caches with frequently accessed entities

Implementation:

def warm_up_entity_id_cache():
    """Warm up entity_id_mapping cache for top entities"""
    # Query top 10,000 entities by access count (or all if small dataset)
    top_entities = db.query(
        "SELECT external_id, internal_id, entity_type, created_at FROM entity_id_mapping "
        "ORDER BY access_count DESC LIMIT 10000"
    )

    # Batch populate cache
    cache.warm_up([e['external_id'] for e in top_entities])

    logging.info(f"Warmed up cache for {len(top_entities)} entities")

Periodic Warm-up

Schedule: Every 6 hours

Purpose: Refresh cache for expired or high-value entities

# Cron job to warm up cache
0 */6 * * * warm-up-cache.sh

Monitoring

Key Metrics (from CACHING-STRATEGY.md)

Cache Hit Rates

entity_id_mapping_cache_hit_rate
  - labels: {cache: valkey}
  - gauge: 0.95-0.99 (target >95%)
  - alert: <90% for >5 minutes

entity_head_cache_hit_rate
  - labels: {cache: valkey}
  - gauge: 0.80-0.90
  - alert: <70% for >5 minutes

client_cache_hit_rate
  - labels: {source: api}
  - gauge: 0.40-0.60
  - alert: <30% for >1 hour

cdn_cache_hit_rate
  - labels: {cdn: cloudfront}
  - gauge: 0.70-0.85
  - alert: <60% for >15 minutes

Cache Performance

cache_lookup_latency_p50
  - labels: {cache: valkey, operation: entity_id_mapping}
  - gauge: <1ms
  - alert: >2ms for >5 minutes

cache_lookup_latency_p99
  - labels: {cache: valkey, operation: entity_id_mapping}
  - gauge: <5ms
  - alert: >10ms for >5 minutes

valkey_memory_usage_bytes
  - gauge: <50GB (for 10M entities, 100 bytes each)
  - alert: >40GB (90% capacity)
  - alert: >45GB (90% capacity)

Cost Metrics

s3_get_operations_total
  - counter: ~1M/week (baseline)
  - alert: >2M/week (cache not effective)

vitess_query_latency_p99
  - gauge: <50ms
  - alert: >100ms (cache not effective)

Implementation Roadmap

Phase 1: Infrastructure Setup (Week 1-2)

[ ] Configure Valkey/Memcached cluster
[ ] Set up S3 bucket with lifecycle policies
[ ] Configure CDN (CloudFront/Cloudflare)
[ ] Create Vitess tables (entity_id_mapping, entity_head, entity_revisions)
[ ] Create entity metadata tables (labels, descriptions)
[ ] Deploy Application Object Cache
[ ] Set up monitoring (Prometheus + Grafana)

Phase 2: Cache Implementation (Week 3-4)

[ ] Implement entity_id_mapping cache
[ ] Implement entity_head cache
[ ] Implement entity_metadata cache
[ ] Implement cache invalidation logic
[ ] Add Vitess query caching
[ ] Implement warm-up scripts
[ ] Write cache invalidation cron jobs

Phase 3: RDF Endpoint Integration (Week 5-6)

[ ] Create API endpoint for entity RDF generation
[ ] Integrate with Application Object Cache
[ ] Add HTTP caching headers
[ ] Implement compression (gzip, Brotli)
[ ] Add Prometheus metrics
[ ] Load testing and optimization

Phase 4: S3 Integration (Week 7-8)

[ ] Implement S3 snapshot writer
[ ] Configure CDN distribution
[ ] Set up lifecycle policies
[ ] Create revision storage pattern
[ ] Test CDN caching behavior

Phase 5: Optimization (Week 9-10)

[ ] Analyze cache hit rates and optimize TTLs
[ ] Implement cache warm-up scheduling
[ ] Optimize database queries (add indexes)
[ ] Implement CDN bypass for cache misses
[ ] Performance testing at scale

Trade-offs and Considerations

On-Demand vs Pre-Generated RDF

On-Demand (our approach): - ✅ Always fresh data - ✅ No storage costs for pre-generated dumps - ✅ Can handle very large entities efficiently - ✅ Matches MediaWiki's approach - ❌ Higher per-request latency (50-200ms) - ❌ Higher database load

Pre-Generated (Wikidata's approach): - ✅ Very fast (serve static file) - ✅ No database load for reads - ✅ Minimal application CPU - ❌ Data can be up to 1 week stale - ❌ High storage costs (petabytes of RDF) - ❌ Complex change propagation (must regenerate all dependent dumps)

Our Hybrid Strategy: - Use caching for fast lookups (metadata, labels) - Generate RDF on-demand (ensures freshness) - Cache only what's cheap (metadata, not expensive full RDF) - Balance between freshness and cost

Cache Layer Selection

When to use which cache layer:

Data Type	Best Cache Layer	Rationale
External ID → Internal ID	entity_id_mapping	Never changes, cache indefinitely
Entity head	entity_head cache	Changes frequently, need short TTL (5 min)
Entity labels/descriptions	entity_metadata cache	Change occasionally, medium TTL (30 min)
Entity data	S3 (immutable)	Generate on-demand, cache only via CDN

Change Detection and Propagation

Challenge: How do we know when to invalidate caches?

Solution: 1. Entity metadata tables in Vitess: Track entity modified_at timestamp 2. Revision tracking: New revision ID on each update 3. Cache invalidation: Invalidate all related caches on entity update 4. Event streaming: Subscribe to Vitess change events for real-time updates

Simple approach (MVP): - Use modified_at timestamp from entity table - Invalidate entity_head and entity_metadata caches on every update - Let entity_id_mapping cache expire naturally (1 hour)

MVP Implementation Plan

Week 1: Core Caching (Minimal Viable Product)

Goal: 70-80% cache hit rate for hot entities, <200ms response time

Scope: 1. Deploy single Valkey instance (or Memcached cluster) 2. Create entity_id_mapping cache layer 3. Implement simple cache invalidation (1-hour TTL) 4. Add HTTP caching headers to API responses 5. Basic monitoring (cache hit rates, response times)

What's NOT in scope: - Entity head cache - Entity metadata cache - S3 snapshots - CDN distribution - Warm-up scripts

Week 2: Extended Caching

Scope: 1. Add entity_head cache 2. Add entity_metadata cache 3. Implement proper cache invalidation logic 4. Add warm-up scripts 5. Advanced monitoring (Grafana dashboards)

Week 3-4: Performance Optimization

Scope: 1. Database query optimization (add indexes) 2. Vitess query caching 3. API response compression 4. Load testing and optimization 5. CDN integration (optional)

Week 5+: S3 Integration (Optional)

Scope: 1. S3 snapshot writer 2. CDN configuration 3. Lifecycle policies 4. Global replication

Success Criteria

Performance Targets

[ ] P50 latency < 120ms for hot entities (cached)
[ ] P99 latency < 200ms for hot entities (cached)
[ ] P50 latency < 400ms for cold entities (not cached)
[ ] entity_id_mapping cache hit rate > 95%
[ ] entity_head cache hit rate > 80%
[ ] entity_metadata cache hit rate > 70%

Cost Targets

[ ] < 100ms Vitess query latency (cache hit)
[ ] Valkey cost < $200/month at target scale
[ ] S3 cost < $100/month at target scale
[ ] CDN cost < $50/month (if used)

Reliability

[ ] 99.9% uptime
[ ] Graceful degradation (serve from cache if database unavailable)
< 1 minute MTTR (Mean Time To Recover)

Next Steps

Review this document with team to validate architecture
Estimate infrastructure requirements (Valkey size, S3 storage, bandwidth)
Create detailed implementation plan for MVP (Week 1)
Set up monitoring strategy (Prometheus, Grafana, alerting)
Prioritize features based on business value (latency vs. cost)

References

CACHING-STRATEGY.md - Detailed caching strategy
STORAGE-ARCHITECTURE.md - Vitess database design
ENTITY-MODEL.md - Entity data model
SCALING-PROPERTIES.md - System scaling characteristics