Dagster Data Pipeline Platform
The Challenge
Med-Metrix processes millions of healthcare claims across multiple provider networks. Their existing data flows were a patchwork of scheduled SQL jobs, manual CSV imports, and brittle stored procedures — making it nearly impossible to trace data lineage or catch quality issues before they impacted revenue reports.
What I Built
I designed and deployed a production data platform using Dagster as the orchestration layer with a Medallion Architecture (Bronze → Silver → Gold) that brought structure and reliability to the entire data lifecycle.
Bronze Layer — Raw ingestion from legacy SQL Server databases, flat files from vendor FTP drops, and API feeds. Everything lands as-is, timestamped and immutable, into S3-backed storage.
Silver Layer — This is where the heavy lifting happens. DBT models clean, validate, and conform the data — standardizing claim formats across providers, deduplicating records, and flagging anomalies. Data quality checks run at every boundary.
Gold Layer — Business-ready aggregations powering executive dashboards: revenue by payer, denial rates by procedure code, aging reports, and provider performance metrics.
Key Technical Decisions
- Dagster over Airflow: Software-defined assets gave us explicit dependency graphs and built-in data lineage tracking. The development experience was significantly better — assets are testable Python functions, not DAG XML.
- AWS Glue + Athena for ad-hoc analytics: Instead of loading everything into a data warehouse upfront, we built a queryable data lake. Analysts could explore Silver/Gold data with SQL without waiting for ETL to finish.
- Terraform + GitHub Actions for IaC: Every piece of infrastructure — ECS tasks, S3 buckets, Glue jobs, IAM roles — was version-controlled and deployed through CI/CD.
Impact
- Replaced fragile nightly SQL jobs with observable, retry-capable pipelines
- Cut report generation time from hours to minutes by pre-computing Gold aggregations
- Gave the data team self-service access to clean data via Athena — no more ad-hoc requests to engineering
- Migrated the entire pipeline from on-premises to AWS ECS Fargate, eliminating server maintenance overhead
What I Learned
Building data platforms in healthcare taught me that data quality is not optional — it is the product. A pipeline that runs successfully but produces bad numbers is worse than one that fails loudly. Every boundary between layers became a checkpoint, and that discipline made all the difference.