Skip to content
← All posts
2 min readby auxin

Idempotency: the feature no one asks for

Idempotent pipelines aren't a nice-to-have — they're the only kind of pipeline that survives an on-call rotation.

  • #data-engineering
  • #pipelines
  • #reliability

Every data team eventually meets the 3 AM page where a Slack thread starts with "can we just re-run it?" The answer is yes, if and only if your pipelines are idempotent.

What idempotent actually means

Running the pipeline twice with the same input produces the same output, exactly.

Not "approximately the same." Not "the same plus some duplicates we'll dedupe later." Exactly the same. This is harder than it sounds because most ingestion patterns are append-only by default.

Three patterns that get you most of the way

  1. Deterministic partition keys. Every row gets a (source, business_date) key. Re-runs overwrite the partition, never append.
  2. Stage-then-merge. Write to a staging.{table}__{run_id} table first. Merge into the final table in a single transaction.
  3. Watermarks, not "latest." Don't ingest "everything since now." Ingest "everything where updated_at is between X and Y." X and Y are parameters, not implicit globals.

What this buys you

  • Backfills become normal operations, not heroic ones.
  • Late-arriving data has a defined recovery path.
  • The pipeline can be paused, resumed, or rewound without surgery.

If a pipeline can't be re-run safely, it's not a pipeline. It's a one-shot script with delusions of grandeur.

We'll come back to merge patterns in dbt and Iceberg in a later post.