Building Data Pipelines with Dagster and the Medallion Architecture

· 5 min read · #python#data-engineering#aws

I spent about two years building data pipelines for a healthcare company. We processed claims data from multiple provider networks, ran it through a medallion architecture, and served it up for analytics that fed directly into revenue decisions. Dagster was the orchestrator at the center of it, and I think it is the best tool for this kind of work right now.

Here is why, and how we built it.

Why Not Airflow?

Airflow is the default answer when someone says "data orchestration." I have used it. It works. But when we evaluated our options, Dagster won on three points that mattered in practice.

Software-defined assets are the big one. In Airflow, you define tasks and dependencies between tasks. In Dagster, you define assets, the actual data artifacts your pipeline produces, and the framework figures out the execution order. Instead of "run task A then task B," you think "this dataset depends on that dataset." The dependency graph becomes the data lineage.

Testing as a first-class citizen. Each Dagster asset is a Python function. You can unit test it. You can pass it mock inputs. You can run a single asset in isolation. Try doing that with an Airflow DAG. It is possible, but it feels like fighting the framework.

The development experience. Dagster's local UI (Dagit) lets you materialize assets, inspect their metadata, see the full lineage graph, and debug failures, all without deploying anything. That tight feedback loop during development was invaluable.

The Medallion Architecture in Practice

If you have not encountered the medallion pattern before, here is the idea: organize your data lake into three layers based on data quality and transformation level.

Bronze: Raw Ingestion

Everything lands here exactly as it arrives from the source. No transformations, no filtering. We timestamped every ingestion and stored raw data in S3 as Parquet files. The principle is simple: you can always re-derive Silver and Gold from Bronze, but you cannot recover data you threw away at ingestion.

@asset(group_name="bronze", compute_kind="python")
def bronze_claims(context: AssetExecutionContext) -> pd.DataFrame:
    """Ingest raw claims from the source database."""
    engine = create_engine(os.environ["SOURCE_DB_URL"])
    query = """
        SELECT * FROM dbo.Claims
        WHERE ModifiedDate > :last_run
    """
    last_run = context.instance.get_latest_materialization_event(
        AssetKey("bronze_claims")
    )
    df = pd.read_sql(query, engine, params={"last_run": last_run or "1900-01-01"})
    context.log.info(f"Ingested {len(df)} new/modified claims")
    return df

In practice, we had dozens of Bronze assets: claims, payments, provider rosters, eligibility files, denial codes. Each one was a separate asset with its own schedule and its own data quality expectations.

Silver: Clean and Conform

This is where DBT earned its keep. Silver layer transformations clean the data, standardize formats, resolve duplicates, and validate business rules. We ran DBT models through Dagster's dbt integration, so our DBT runs were visible in the same lineage graph as our Python assets.

The key discipline at the Silver layer: every transformation must be documented and testable. When a revenue report shows unexpected numbers, the first question is always "what changed in the data?" Silver layer documentation and DBT tests gave us the answer.

# dbt model: silver_claims.yml
models:
  - name: silver_claims
    description: "Cleaned and deduplicated claims with standardized payer codes"
    columns:
      - name: claim_id
        tests:
          - unique
          - not_null
      - name: payer_code
        tests:
          - not_null
          - accepted_values:
              values: ['MEDICARE', 'MEDICAID', 'BCBS', 'AETNA', 'CIGNA', 'UHC', 'OTHER']
    tests:
      - dbt_utils.recency:
          datepart: day
          field: modified_date
          interval: 2

Gold: Business-Ready Analytics

Gold assets are the aggregations and metrics that business users actually consume. Revenue by payer. Denial rates by procedure code. Aging reports. Provider performance scorecards.

These assets were materialized on schedule and served to dashboards via Athena queries. Because they are Dagster assets, the framework knows their upstream dependencies. If a Bronze ingestion fails, Dagster knows not to materialize the downstream Gold assets, and it tells you exactly which reports are affected.

Lessons from Production

After running this system for several months in production, here are the things I wish I had known from the start.

Partition from day one. We started without partitions and had to retrofit them later. Daily partitions on Bronze assets would have saved us significant reprocessing costs and made incremental runs straightforward from the beginning. If you are reading this before building your pipeline, partition everything.

Data quality is not a phase, it is continuous. We added quality checks at every layer boundary. Bronze-to-Silver checks caught format issues and missing fields. Silver-to-Gold checks caught business logic violations. Without these checks, bad data silently corrupts reports, and nobody notices until a stakeholder questions the numbers in a meeting.

Monitor data freshness, not just pipeline success. A pipeline can succeed and still produce stale data if the source system did not update. We built freshness monitors that alerted when a Gold asset had not been materialized within its expected window, regardless of whether the pipeline itself was green.

Use S3 plus Athena for the query layer. Instead of loading Gold data into a traditional data warehouse, we kept it in S3 and queried it with Athena. For our use case (analysts running ad-hoc queries and dashboards refreshing periodically) this was significantly cheaper and simpler than maintaining a warehouse. Glue crawlers kept the Athena catalog updated automatically.

Should You Use Dagster?

If you are building data pipelines that produce multiple interdependent datasets, need clear data lineage, and want a development experience that does not feel like fighting your tools, yes. Dagster is the best option I have used, and it keeps getting better with every release.

The medallion architecture is not tied to Dagster specifically, but they complement each other well. The Bronze/Silver/Gold layers give you a clear organizational model, and Dagster's software-defined assets make that model executable, testable, and observable.

I went into this project expecting to use Airflow and came out a Dagster advocate. Sometimes the less popular tool is the better tool.

Alvin Almodal

Alvin Almodal

Cloud & Data Engineering Consultant. Your partner for cloud-native builds and data pipelines.