Event-Driven Architecture
Apache Kafka as the sole inter-service communication mechanism. No synchronous HTTP calls — every service owns its data and communicates exclusively through domain events.
Event envelope
Every event follows a standardized schema — consistent structure across all 49 event types, enabling deduplication, tracing, and multi-tenant isolation.
{ "eventId": "550e8400-e29b-41d4-a716...", "eventType": "order.created", "version": "v1", "tenantId": "tenant-123", "correlationId": "corr-456", "causationId": "event-that-caused-this", "actor": { "id": "user-789", "type": "user" } }
Unique identifier (UUID v4) for deduplication
Domain action — order.created, provisioning.completed
Schema version for backward compatibility
Multi-tenant isolation — events scoped to organization
Links all events in a single business flow
The event that triggered this one (causal chain)
Who triggered: user, service, system, or API key
49 event types, 7 domains
Every state change emits a typed event. Notification Service is the universal subscriber — it listens to all 24 customer-facing events across every domain.
Publish / Subscribe Matrix
Order
11
pub
5
sub
Provisioning
8
pub
2
sub
Billing
10
pub
4
sub
Notification
2
pub
24
sub
Identity
5
pub
0
sub
Product
6
pub
0
sub
Support
6
pub
0
sub
Signed, partitioned, verified
Every event is cryptographically signed with HMAC-SHA256. Messages partitioned by correlationId for strict ordering within business flows.
HMAC-SHA256 Signing
Tamper-proof event verification
Producer computes signature over serialized event body
Consumer verifies using timing-safe comparison
Production: invalid signature → rejected, sent to DLQ
Development: invalid signature → warning logged, processed
correlationId Partitioning
Strict ordering within business flows
All events for the same order land on the same partition
Same consumer always processes the same order's events
Strict ordering guaranteed within a single business flow
No out-of-order issues for related events
Retry profiles & error handling
Three retry profiles based on event criticality. Six error categories with deterministic routing — retry with exponential backoff or dead letter queue.
Critical
5
Retries
500ms
Initial
60s
Max
Payment, provisioning
Default
3
Retries
1s
Initial
30s
Max
Order lifecycle
Non-Critical
2
Retries
2s
Initial
10s
Max
Notifications, logging
Error Classification
Timeout, connection refused
Rate limited, resource locked
Invalid payload, missing field
Referenced entity doesn't exist
Unauthorized access
Version conflict
Zero message loss
Failed messages that exhaust all retries go to the Dead Letter Queue. If Kafka itself is unavailable, file fallback ensures zero message loss.
Topic Naming
Retry
Transient error, under retry limit
Re-publish to original topic with backoff
Discard
Corrupt data, invalid schema
Log and abandon
Alert
Threshold exceeded (10+/hr)
Trigger monitoring alert
Manual Review
All DLQ retries exhausted
Create incident for ops team
2-Layer Resilience
File fallback when Kafka is unavailable
Kafka Down
Messages written to local JSONL files
Kafka Reconnects
Automatic replay from fallback files
Replay Fails
Moved to .failed files for investigation
Schema evolution & idempotency
Backward-compatible evolution via automatic upcasting. Duplicate detection using database-backed idempotency keys — no double processing, ever.
Automatic Upcasting
Idempotent Processing
Event arrives
hash(eventId + correlationId + handler)
Check database
Key exists → return cached result (duplicate)
Process new
Acquire lock, process, store result
Cache result
Success: 24h, errors: 30s (allow retry)
Cleanup
Periodic job purges expired keys
End-to-end distributed tracing
W3C Trace Context headers propagated through every Kafka message. Complete visibility from HTTP request through event processing to downstream spans.
Trace Propagation
Context Headers
Injected by producer into every Kafka header
Vendor-specific trace data propagated end-to-end
Included in every log entry for correlation
Business-level linking across all related events
Order fulfillment, event by event
From customer order to delivered credentials — every step is an event. Failure triggers automatic compensation with refund coordination.
Success Path
Customer places order
Payment confirmed
Infrastructure deployed
Order fulfilled
Credentials delivered
Failure & Compensation
Provisioning fails (non-retryable)
Refund completed
Order marked failed
Compensation Guarantees
Platform guarantees
Seven guarantees that define how every event flows through the platform. At-least-once delivery, no duplicates, strict ordering, tamper-proof, and fully traceable.
At-least-once delivery
Kafka replication + consumer offset commits
No duplicates
Idempotency keys in database
Ordering
correlationId-based partitioning
No message loss
DLQ + file fallback
Tamper-proof
HMAC-SHA256 event signing
Traceable
W3C Trace Context + correlation IDs
Backward-compatible
Automatic schema upcasting
Every state change, an event
49 event types across 7 domains. Loose coupling, guaranteed delivery, and complete observability — built on Apache Kafka.
Common Questions
Kafka provides loose coupling, independent deployability, and guaranteed message delivery. Services own their data and communicate through domain events — there are no synchronous HTTP calls between services. This makes the system resilient to individual service failures.
Every consumer uses database-backed idempotency keys. When an event arrives, we compute hash(eventId + correlationId + handler) and check the database. Duplicates return the cached result. Success is cached for 24 hours, errors for 30 seconds to allow retry.
We have a 2-layer resilience model. If Kafka is unavailable when sending to DLQ, messages are written to local JSONL fallback files. When Kafka reconnects, automatic replay kicks in. Failed replays move to .failed files for manual investigation. Zero message loss guaranteed.
Messages are partitioned by correlationId. All events for the same order land on the same partition, ensuring the same consumer always processes that order's events in strict sequence. No out-of-order issues for related events.
Events support backward-compatible evolution via automatic upcasting. Old producers keep sending v1 indefinitely. Consumers upcast on read through the v1 → v2 → v3 chain. New required fields get sensible defaults during upcast. No reprocessing required.
Engineering culture
Short reads that sharpen your engineering instincts and help you stay ahead of the curve.