Data Pipelines
A data pipeline is the sequence of processes that transforms raw data into a form suitable for model training or inference. Reliable, scalable data pipelines are a prerequisite for production ML systems.
Pipeline Stages
Raw data sources
→ Ingestion
→ Validation
→ Transformation / Preprocessing
→ Splitting (train / val / test)
→ Feature computation
→ Storage (feature store / artifact store)
→ Consumption by training job
Data Ingestion
Batch ingestion: load data periodically (hourly, daily). Simple; data is not immediately fresh.
Streaming ingestion: process events as they arrive (Kafka, Kinesis, Pub/Sub). Near-real-time; more complex.
Change Data Capture (CDC): track row-level changes in a source database; emit them as a stream to the pipeline.
Common sources: databases (PostgreSQL, MySQL), object storage (S3, GCS), data warehouses (BigQuery, Snowflake), external APIs, web scrapes.
Data Validation
Catch data quality issues before they propagate to model training.
Schema validation: check that columns have the expected types, nullability, and value ranges.
Distribution checks: compare feature statistics to a reference baseline. Flag significant deviations:
- Mean / standard deviation shift.
- Null rate change.
- Categorical cardinality change.
Great Expectations: define data quality assertions as “expectations”; generate an HTML report of pass/fail. Widely used in production.
TFX Data Validation (TFDV): compute schema and statistics from training data; validate serving data against the schema.
Data Transformation
Normalization and standardization: see preprocessing for specific techniques.
Encoding: encode categorical variables (one-hot, label encoding, target encoding, embedding).
Imputation: handle missing values (mean fill, forward fill, model-based imputation).
Aggregation: roll up raw events to session-level, user-level, or time-window features.
Join enrichment: join raw events with reference tables (user profiles, item metadata) to add contextual features.
ETL vs. ELT
ETL (Extract-Transform-Load): transform data before loading into the target store. Traditional; transformation happens outside the data warehouse.
ELT (Extract-Load-Transform): load raw data into a data warehouse; transform using SQL. Modern approach; leverages the warehouse’s compute (dbt + BigQuery/Snowflake).
dbt (Data Build Tool)
SQL-based transformation framework. Transforms are modular SQL models with dependencies resolved as a DAG. Includes testing, documentation generation, and lineage visualization. Standard for analytics and feature engineering in data warehouses.
Streaming Pipelines
Apache Kafka: distributed message queue. Topics store ordered event streams. Consumers pull messages and process them.
Apache Flink / Spark Streaming: stateful stream processing with exactly-once semantics. Used for real-time feature computation (session aggregates, fraud signals).
Windowing: aggregate events over time windows (tumbling: non-overlapping fixed windows; sliding: overlapping; session: activity-gap-based).
Feature Computation
Point-in-time correctness: when computing features for training, only use information available before the label’s timestamp. Avoid data leakage.
Time-travel features: for each training example $(x, y, t)$, compute features using only data before time $t$. Standard in fraud detection, forecasting.
Backfilling: compute historical features for training data. Must exactly replicate the production feature computation logic.
Data Lineage
Track how each dataset was produced: what source data, what transformation code, what version.
Benefits: reproducibility, debugging, compliance (GDPR data deletion propagation), impact analysis (which models depend on this dataset?).
Tools: Apache Atlas, Marquez, OpenLineage, DataHub.
Pipeline Orchestration
Schedule and monitor pipeline execution.
Apache Airflow: Python-based DAG scheduler. Sensors, operators, XCom for data passing. Widely adopted.
Prefect / Dagster: modern alternatives with better Python-native APIs, dynamic workflows, and improved observability.
Handling failures: retry policies, alerting on failure, partial re-runs (only re-run failed tasks, not the whole pipeline).