Event-Driven Architecture on AWS: Connecting Legacy and Modern Systems

One of the more satisfying architectural challenges I have worked on was connecting a collection of legacy on-premises healthcare systems with a growing set of modern cloud services on AWS. The legacy systems were not going away anytime soon. They processed real claims for real patients every day. But the new services we were building needed to react to events happening in those legacy systems in near real-time.

The answer was event-driven architecture, and AWS gave us a solid toolkit to build it with.

The Problem with Direct Integration

The naive approach would have been to have each new service call the legacy APIs directly. Service A needs claims data? Call the legacy claims API. Service B needs payment updates? Poll the legacy payment database. This path leads somewhere terrible: every new service becomes tightly coupled to the legacy system's data model, authentication scheme, and availability. If the legacy system goes down for maintenance, everything downstream breaks.

We had seen this play out before and did not want to repeat it.

EventBridge as the Central Nervous System

AWS EventBridge became our event bus, the central routing layer that decoupled event producers from consumers. Legacy systems published events to EventBridge, and downstream services subscribed to the events they cared about. No service needed to know about any other service. They just needed to know about events.

{
  "source": "legacy.claims-system",
  "detail-type": "ClaimSubmitted",
  "detail": {
    "claimId": "CLM-2024-78432",
    "providerId": "PRV-1234",
    "payerCode": "MEDICARE",
    "totalAmount": 1247.50,
    "submittedAt": "2024-11-15T14:30:00Z",
    "correlationId": "550e8400-e29b-41d4-a716-446655440000"
  }
}

When we built the data pipeline service three months later, we did not need to modify the claims system at all. We just added a new EventBridge rule that routed ClaimSubmitted events to an SQS queue that the pipeline consumed. Zero changes to existing code.

SQS for Reliable Processing

EventBridge routes events, but SQS is where reliable processing happens. Every consumer that needed guaranteed delivery got its own SQS queue with a dead letter queue (DLQ) attached.

This was critical in healthcare. A lost claim event is not just a technical failure. It is revenue that does not get tracked, which means services that do not get reimbursed. SQS gave us message persistence, automatic retry with backoff, and a DLQ that captured messages we could not process after multiple attempts.

# Lambda consumer with built-in retry handling
def handler(event, context):
    for record in event["Records"]:
        body = json.loads(record["body"])
        claim_id = body["detail"]["claimId"]

        try:
            process_claim_event(body["detail"])
            logger.info(f"Processed claim {claim_id}")
        except TransientError:
            # Let SQS retry via visibility timeout
            raise
        except PermanentError as e:
            # Log and let it go to DLQ after max retries
            logger.error(f"Permanent failure for {claim_id}: {e}")
            raise

We monitored DLQ depth as a primary health metric. A growing DLQ meant something was systemically wrong, not just a transient hiccup but a pattern that needed engineering attention.

Some events needed to reach multiple consumers at the same time. When a claim was adjudicated (approved or denied), the analytics service needed to know, the notification service needed to alert the provider, and the reporting service needed to update its aggregations.

SNS handled the fan-out. One ClaimAdjudicated event published to an SNS topic, three SQS queues subscribed to that topic, three services processed the event independently. If one consumer was slow or temporarily down, the others were not affected.

ClaimAdjudicated → SNS Topic
                    ├── SQS → Analytics Pipeline
                    ├── SQS → Notification Service
                    └── SQS → Reporting Aggregator

This fan-out pattern was one of the most useful things we built. Adding a new consumer to an existing event stream took minutes: create a queue, subscribe it to the topic, deploy the consumer. No coordination with other teams, no changes to the publisher.

Lambda for Stateless Event Handlers

Not every event needed a full service to handle it. For lightweight, stateless event processing (enriching an event with additional data, forwarding it to an external system, or triggering a simple workflow) Lambda was the right fit.

We used Lambda for the bridge between legacy and modern systems. Legacy applications published events to EventBridge in their native format. Lambda functions subscribed to those events, translated them into the clean format our modern services expected, and forwarded them to the appropriate SQS queues.

This translation layer was basically the anti-corruption pattern, but for events. Legacy formats stayed in the legacy world. Modern services consumed clean, well-documented events. The Lambda functions in between were small, focused, and easy to test.

Designing Events That Last

After running this system for several months, the most valuable lesson was about event design, not event infrastructure. Infrastructure is relatively easy to change. Event schemas are hard to change because every consumer depends on them.

Events should be facts, not commands. ClaimSubmitted is a fact. It describes something that happened. ProcessClaim is a command. It tells someone what to do. Facts are stable. Commands create coupling.

Always include a correlation ID. When a single business process generates a chain of events (claim submitted, validated, adjudicated, paid) the correlation ID ties them together. Without it, debugging a multi-service flow across CloudWatch logs is detective work. With it, one search gives you the complete story.

Version your event schemas from the start. We added a schemaVersion field to every event type from day one. When we needed to add fields later, consumers that had not been updated could check the version and handle it gracefully instead of breaking on unexpected fields.

Dead letter queues are not optional. Every SQS queue had a DLQ. Every DLQ had a CloudWatch alarm. We reviewed DLQ contents weekly and used them to identify patterns: serialization bugs, upstream data quality issues, configuration drift. The DLQ was not just error handling. It was our early warning system.

Was It Worth the Complexity?

Event-driven architecture adds indirection. Events are harder to trace than direct API calls. Eventual consistency requires different mental models than request-response. These are real costs.

But for our use case, connecting legacy systems to modern services without coupling them together, the benefits were clear. We added new consumers to existing event streams without touching the producers. We replaced legacy components without disrupting downstream services. We scaled individual services independently based on their event processing load.

If you are building a system where components need to communicate but should not depend on each other, especially if some of those components are legacy systems you cannot easily modify, event-driven architecture on AWS is a solid approach. The tooling is mature, the managed services handle the hard operational problems, and the patterns are well understood.

Start with EventBridge, add SQS for reliable processing, use SNS when you need fan-out, and let Lambda handle the glue. It is a combination that works well and one I would reach for again.