Platform Module

Event-Driven Architecture

Apache Kafka as the sole inter-service communication mechanism. No synchronous HTTP calls — every service owns its data and communicates exclusively through domain events.

Request a Demo View Platform

▶Domain Events

49 event types, 7 domains

Every state change emits a typed event. Notification Service is the universal subscriber — it listens to all 24 customer-facing events across every domain.

Event

Consumers

Status

order.created

Notification

Active

order.payment-confirmed

Billing

Active

order.provisioning-started

Provisioning

Active

order.fulfilled

BillingNotification

Active

order.failed

Notification

Active

order.cancelled

ProvisioningNotification

Active

order.service-delivered

Notification

Active

order.compensation-started

—

Active

order.compensation-completed

—

Active

order.compensation-failed

—

Active

billing.refund-requested

Billing

Active

Publish / Subscribe Matrix

Order

pub

sub

Provisioning

pub

sub

Billing

pub

sub

Notification

pub

sub

Identity

pub

sub

Product

pub

sub

Support

pub

sub

▶Security & Ordering

Signed, partitioned, verified

Every event is cryptographically signed with HMAC-SHA256. Messages partitioned by correlationId for strict ordering within business flows.

HMAC-SHA256 Signing

Tamper-proof event verification

Producer computes signature over serialized event body

Consumer verifies using timing-safe comparison

Production: invalid signature → rejected, sent to DLQ

Development: invalid signature → warning logged, processed

correlationId Partitioning

Strict ordering within business flows

All events for the same order land on the same partition

Same consumer always processes the same order's events

Strict ordering guaranteed within a single business flow

No out-of-order issues for related events

▶Resilience

Retry profiles & error handling

Three retry profiles based on event criticality. Six error categories with deterministic routing — retry with exponential backoff or dead letter queue.

Critical

Retries

500ms

Initial

60s

Max

Payment, provisioning

Default

Retries

Initial

30s

Max

Order lifecycle

Non-Critical

Retries

Initial

10s

Max

Notifications, logging

delay = initialDelay × backoffMultiplier^{(retryCount-1)}, capped at maxDelay

Error Classification

TransientRetry

Timeout, connection refused

Resource BusyRetry

Rate limited, resource locked

ValidationDLQ

Invalid payload, missing field

Not FoundDLQ

Referenced entity doesn't exist

PermissionDLQ

Unauthorized access

InconsistencyDLQ

Version conflict

▶Dead Letter Queue

Zero message loss

Failed messages that exhaust all retries go to the Dead Letter Queue. If Kafka itself is unavailable, file fallback ensures zero message loss.

Topic Naming

cloud-factory.dev.order.fulfilled.v1cloud-factory.dev.order.fulfilled.v1.dlq

Retry

Transient error, under retry limit

Re-publish to original topic with backoff

Discard

Corrupt data, invalid schema

Log and abandon

Alert

Threshold exceeded (10+/hr)

Trigger monitoring alert

Manual Review

All DLQ retries exhausted

Create incident for ops team

2-Layer Resilience

File fallback when Kafka is unavailable

Kafka Down

Messages written to local JSONL files

Kafka Reconnects

Automatic replay from fallback files

Replay Fails

Moved to .failed files for investigation

/tmp/dlq-fallback/2026-03-14.jsonl

▶Evolution & Safety

Schema evolution & idempotency

Backward-compatible evolution via automatic upcasting. Duplicate detection using database-backed idempotency keys — no double processing, ever.

Automatic Upcasting

Old producers keep sending v1 indefinitely

Consumers upcast on read (v1 → v2 → v3 chain)

No reprocessing required when schema changes

New required fields get sensible defaults during upcast

Idempotent Processing

Event arrives

hash(eventId + correlationId + handler)

Check database

Key exists → return cached result (duplicate)

Process new

Acquire lock, process, store result

Cache result

Success: 24h, errors: 30s (allow retry)

Cleanup

Periodic job purges expired keys

hash(eventId + correlationId + handler) → check → process → cache

▶Observability

End-to-end distributed tracing

W3C Trace Context headers propagated through every Kafka message. Complete visibility from HTTP request through event processing to downstream spans.

Trace Propagation

HTTP Request

Kafka Event

Consumer Processing

Downstream Span

Context Headers

traceparent

Injected by producer into every Kafka header

tracestate

Vendor-specific trace data propagated end-to-end

traceId + spanId

Included in every log entry for correlation

correlationId

Business-level linking across all related events

▶Complete Flow

Order fulfillment, event by event

From customer order to delivered credentials — every step is an event. Failure triggers automatic compensation with refund coordination.

Success Path

Customer places order

order.createdNotification → Email + In-App

Payment confirmed

order.provisioning-startedProvisioning Service

Infrastructure deployed

provisioning.completedOrder, Billing, Notification

Order fulfilled

order.fulfilledBilling → Invoice, Notification → Email

Credentials delivered

order.service-deliveredCustomer receives VPS credentials

Failure & Compensation

Provisioning fails (non-retryable)

billing.refund-requestedBilling → Stripe refund

Refund completed

billing.refund-completedOrder Saga → coordination

Order marked failed

order.failedNotification → Failure email + refund confirmation

Compensation Guarantees

Compensation stack built in reverse — last provisioned resource rolled back first

Cancel provisioning — deprovision any already-created infrastructure

Refund payment — automatic refund request to billing service

Retry with exponential backoff — 2s → 4s → 8s → 16s → 30s cap

▶Guarantees

Platform guarantees

Seven guarantees that define how every event flows through the platform. At-least-once delivery, no duplicates, strict ordering, tamper-proof, and fully traceable.

At-least-once delivery

Kafka replication + consumer offset commits

No duplicates

Idempotency keys in database

Ordering

correlationId-based partitioning

No message loss

DLQ + file fallback

Tamper-proof

HMAC-SHA256 event signing

Traceable

W3C Trace Context + correlation IDs

Backward-compatible

Automatic schema upcasting

Every state change, an event

49 event types across 7 domains. Loose coupling, guaranteed delivery, and complete observability — built on Apache Kafka.

Get Started Explore Notifications

▶FAQ

Common Questions

Kafka provides loose coupling, independent deployability, and guaranteed message delivery. Services own their data and communicate through domain events — there are no synchronous HTTP calls between services. This makes the system resilient to individual service failures.

Every consumer uses database-backed idempotency keys. When an event arrives, we compute hash(eventId + correlationId + handler) and check the database. Duplicates return the cached result. Success is cached for 24 hours, errors for 30 seconds to allow retry.

We have a 2-layer resilience model. If Kafka is unavailable when sending to DLQ, messages are written to local JSONL fallback files. When Kafka reconnects, automatic replay kicks in. Failed replays move to .failed files for manual investigation. Zero message loss guaranteed.

Messages are partitioned by correlationId. All events for the same order land on the same partition, ensuring the same consumer always processes that order's events in strict sequence. No out-of-order issues for related events.

Events support backward-compatible evolution via automatic upcasting. Old producers keep sending v1 indefinitely. Consumers upcast on read through the v1 → v2 → v3 chain. New required fields get sensible defaults during upcast. No reprocessing required.

▶From the blog