Building Idempotent Feature Pipelines in Databricks

Feature pipelines are where most ML projects silently fail. Not at training time — at rerun time. A pipeline that works once but produces different results on a second run is dangerous because you won't know it happened until your model starts drifting.

This post documents the snapshot-based idempotency pattern I ended up using for production feature pipelines on Databricks. I'll describe the problem, the naive approach that doesn't work, and the version that does.

The problem: reruns that corrupt

Most feature engineering pipelines look like this:

Read raw events from source tables
Apply transformations (aggregations, joins, normalizations)
Write the result to a feature table

Simple. But what happens on a rerun?

If you append to the feature table without checking what's already there, you get duplicates. If you overwrite the whole table, you lose historical snapshots that training jobs might be mid-read. If you use MERGE INTO without careful key design, you get partial updates that look correct but aren't.

The real issue is that these pipelines are not idempotent: running them twice with the same input produces a different output than running them once.

The snapshot pattern

The fix is to make snapshot identity explicit. Every run of the pipeline corresponds to exactly one snapshot, identified by a snapshot_date (or run_id if you're working with non-time-based cuts). The contract becomes:

For a given snapshot_date, the pipeline produces exactly the same rows every time it runs.

This means:

Delete before write: before writing features for a given snapshot_date, delete all existing rows with that key.
Write atomically: use a single INSERT INTO (or df.write.mode("append")) after the delete. Wrap both in a transaction or use Delta's built-in ACID guarantees.
Never update in place: if you need to fix a historical snapshot, you re-run for that date, which triggers the delete-then-write pattern.

In PySpark on Databricks:

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

SNAPSHOT_DATE = "2026-04-01"
FEATURE_TABLE = "catalog.schema.features"

# 1. Delete existing snapshot
delta_table = DeltaTable.forName(spark, FEATURE_TABLE)
delta_table.delete(f"snapshot_date = '{SNAPSHOT_DATE}'")

# 2. Compute features
features_df = (
    spark.table("catalog.schema.raw_events")
    .filter(f"event_date < '{SNAPSHOT_DATE}'")
    .groupBy("entity_id")
    .agg(...)
    .withColumn("snapshot_date", lit(SNAPSHOT_DATE))
)

# 3. Append (Delta's ACID guarantees the write is atomic)
features_df.write.format("delta").mode("append").saveAsTable(FEATURE_TABLE)

Why not MERGE?

MERGE INTO (upsert) is tempting because it handles the "insert if not exists, update if exists" case in one operation. The problem is that it requires a key that uniquely identifies a row, and in feature tables the key is usually (entity_id, snapshot_date) — but if your pipeline has bugs that generate duplicate entity_id rows for a snapshot, MERGE will silently pick one and discard the other, or fail with a non-determinism error.

Delete-then-append is more explicit. Duplicates within a snapshot surface as errors immediately. The invariant is easy to test: COUNT(*) WHERE snapshot_date = X should always equal the number of entities in that snapshot.

Partitioning by snapshot_date

For this pattern to be efficient, the feature table must be partitioned by snapshot_date. Without partitioning, the DELETE scans the entire table. With partitioning, it's a metadata-only operation on Delta Lake.

features_df.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("snapshot_date") \
    .saveAsTable(FEATURE_TABLE)

Add this on table creation. If the table already exists without partitioning, you'll need to rebuild it — there's no in-place repartition for Delta tables on most Databricks runtimes.

Making reruns safe for downstream training jobs

The last concern is concurrent reads. If a training job is reading snapshot_date = '2026-04-01' while you're running a delete-then-write for the same date, does it see a consistent view?

Yes — Delta Lake's MVCC (multi-version concurrency control) guarantees snapshot isolation. The training job reads from the version of the table at query start time. Your delete and rewrite create a new version; the training job's read is unaffected.

This is the main reason to use Delta Lake over plain Parquet for feature tables: you get ACID transactions and time-travel for free, which makes the delete-then-write pattern safe.

Summary

Make snapshot_date (or equivalent) an explicit first-class column, not an implicit property of the run
Partition by snapshot_date for efficient deletes
Use delete-then-append instead of MERGE or blind overwrites
Delta Lake's snapshot isolation makes this safe for concurrent readers

This pattern adds a small amount of write overhead (the delete step) but eliminates an entire class of correctness bugs. For production feature pipelines where training jobs rely on historical snapshots, that trade-off is non-negotiable.