Platform Module

Event-Driven Architecture

Apache Kafka as the sole inter-service communication mechanism. No synchronous HTTP calls — every service owns its data and communicates exclusively through domain events.

Standardized Schema

Event envelope

Every event follows a standardized schema — consistent structure across all 49 event types, enabling deduplication, tracing, and multi-tenant isolation.

event-envelope.json
{
  "eventId": "550e8400-e29b-41d4-a716...",
  "eventType": "order.created",
  "version": "v1",
  "tenantId": "tenant-123",
  "correlationId": "corr-456",
  "causationId": "event-that-caused-this",
  "actor": { "id": "user-789", "type": "user" }
}
eventId

Unique identifier (UUID v4) for deduplication

eventType

Domain action — order.created, provisioning.completed

version

Schema version for backward compatibility

tenantId

Multi-tenant isolation — events scoped to organization

correlationId

Links all events in a single business flow

causationId

The event that triggered this one (causal chain)

actor

Who triggered: user, service, system, or API key

Domain Events

49 event types, 7 domains

Every state change emits a typed event. Notification Service is the universal subscriber — it listens to all 24 customer-facing events across every domain.

Event
Consumers
Status
order.created
Notification
order.payment-confirmed
Billing
order.provisioning-started
Provisioning
order.fulfilled
BillingNotification
order.failed
Notification
order.cancelled
ProvisioningNotification
order.service-delivered
Notification
order.compensation-started
order.compensation-completed
order.compensation-failed
billing.refund-requested
Billing

Publish / Subscribe Matrix

Order

11

pub

5

sub

Provisioning

8

pub

2

sub

Billing

10

pub

4

sub

Notification

2

pub

24

sub

Identity

5

pub

0

sub

Product

6

pub

0

sub

Support

6

pub

0

sub

Security & Ordering

Signed, partitioned, verified

Every event is cryptographically signed with HMAC-SHA256. Messages partitioned by correlationId for strict ordering within business flows.

HMAC-SHA256 Signing

Tamper-proof event verification

01

Producer computes signature over serialized event body

02

Consumer verifies using timing-safe comparison

03

Production: invalid signature → rejected, sent to DLQ

04

Development: invalid signature → warning logged, processed

correlationId Partitioning

Strict ordering within business flows

All events for the same order land on the same partition

Same consumer always processes the same order's events

Strict ordering guaranteed within a single business flow

No out-of-order issues for related events

P0
P1
P2
P3
Resilience

Retry profiles & error handling

Three retry profiles based on event criticality. Six error categories with deterministic routing — retry with exponential backoff or dead letter queue.

Critical

5

Retries

500ms

Initial

60s

Max

Payment, provisioning

Default

3

Retries

1s

Initial

30s

Max

Order lifecycle

Non-Critical

2

Retries

2s

Initial

10s

Max

Notifications, logging

delay = initialDelay × backoffMultiplier(retryCount-1), capped at maxDelay

Error Classification

TransientRetry

Timeout, connection refused

Resource BusyRetry

Rate limited, resource locked

ValidationDLQ

Invalid payload, missing field

Not FoundDLQ

Referenced entity doesn't exist

PermissionDLQ

Unauthorized access

InconsistencyDLQ

Version conflict

Dead Letter Queue

Zero message loss

Failed messages that exhaust all retries go to the Dead Letter Queue. If Kafka itself is unavailable, file fallback ensures zero message loss.

Topic Naming

cloud-factory.dev.order.fulfilled.v1cloud-factory.dev.order.fulfilled.v1.dlq

Retry

Transient error, under retry limit

Re-publish to original topic with backoff

Discard

Corrupt data, invalid schema

Log and abandon

Alert

Threshold exceeded (10+/hr)

Trigger monitoring alert

Manual Review

All DLQ retries exhausted

Create incident for ops team

2-Layer Resilience

File fallback when Kafka is unavailable

01

Kafka Down

Messages written to local JSONL files

02

Kafka Reconnects

Automatic replay from fallback files

03

Replay Fails

Moved to .failed files for investigation

/tmp/dlq-fallback/2026-03-14.jsonl
Evolution & Safety

Schema evolution & idempotency

Backward-compatible evolution via automatic upcasting. Duplicate detection using database-backed idempotency keys — no double processing, ever.

Automatic Upcasting

v1
v2
v3
Old producers keep sending v1 indefinitely
Consumers upcast on read (v1 → v2 → v3 chain)
No reprocessing required when schema changes
New required fields get sensible defaults during upcast

Idempotent Processing

01

Event arrives

hash(eventId + correlationId + handler)

02

Check database

Key exists → return cached result (duplicate)

03

Process new

Acquire lock, process, store result

04

Cache result

Success: 24h, errors: 30s (allow retry)

05

Cleanup

Periodic job purges expired keys

hash(eventId + correlationId + handler) → check → process → cache
Observability

End-to-end distributed tracing

W3C Trace Context headers propagated through every Kafka message. Complete visibility from HTTP request through event processing to downstream spans.

Trace Propagation

HTTP Request
Kafka Event
Consumer Processing
Downstream Span

Context Headers

traceparent

Injected by producer into every Kafka header

tracestate

Vendor-specific trace data propagated end-to-end

traceId + spanId

Included in every log entry for correlation

correlationId

Business-level linking across all related events

Complete Flow

Order fulfillment, event by event

From customer order to delivered credentials — every step is an event. Failure triggers automatic compensation with refund coordination.

Success Path

Customer places order

order.createdNotification → Email + In-App

Payment confirmed

order.provisioning-startedProvisioning Service

Infrastructure deployed

provisioning.completedOrder, Billing, Notification

Order fulfilled

order.fulfilledBilling → Invoice, Notification → Email

Credentials delivered

order.service-deliveredCustomer receives VPS credentials

Failure & Compensation

Provisioning fails (non-retryable)

billing.refund-requestedBilling → Stripe refund

Refund completed

billing.refund-completedOrder Saga → coordination

Order marked failed

order.failedNotification → Failure email + refund confirmation

Compensation Guarantees

Compensation stack built in reverse — last provisioned resource rolled back first
Cancel provisioning — deprovision any already-created infrastructure
Refund payment — automatic refund request to billing service
Retry with exponential backoff — 2s → 4s → 8s → 16s → 30s cap
Guarantees

Platform guarantees

Seven guarantees that define how every event flows through the platform. At-least-once delivery, no duplicates, strict ordering, tamper-proof, and fully traceable.

At-least-once delivery

Kafka replication + consumer offset commits

No duplicates

Idempotency keys in database

Ordering

correlationId-based partitioning

No message loss

DLQ + file fallback

Tamper-proof

HMAC-SHA256 event signing

Traceable

W3C Trace Context + correlation IDs

Backward-compatible

Automatic schema upcasting

Every state change, an event

49 event types across 7 domains. Loose coupling, guaranteed delivery, and complete observability — built on Apache Kafka.

FAQ

Common Questions

Kafka provides loose coupling, independent deployability, and guaranteed message delivery. Services own their data and communicate through domain events — there are no synchronous HTTP calls between services. This makes the system resilient to individual service failures.

Every consumer uses database-backed idempotency keys. When an event arrives, we compute hash(eventId + correlationId + handler) and check the database. Duplicates return the cached result. Success is cached for 24 hours, errors for 30 seconds to allow retry.

We have a 2-layer resilience model. If Kafka is unavailable when sending to DLQ, messages are written to local JSONL fallback files. When Kafka reconnects, automatic replay kicks in. Failed replays move to .failed files for manual investigation. Zero message loss guaranteed.

Messages are partitioned by correlationId. All events for the same order land on the same partition, ensuring the same consumer always processes that order's events in strict sequence. No out-of-order issues for related events.

Events support backward-compatible evolution via automatic upcasting. Old producers keep sending v1 indefinitely. Consumers upcast on read through the v1 → v2 → v3 chain. New required fields get sensible defaults during upcast. No reprocessing required.