ADR-0011: Data Quality Monitoring

Status

Accepted

Context

Data quality issues can significantly impact ML model performance. Several complex data problems needed to be addressed:

  • Data consistency: New training datasets can differ significantly from previous datasets, potentially requiring changes in model architecture.

  • Upstream pipeline bugs: Bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.

  • Training/serving skew: Distribution shift between training and serving data can decrease model performance.

Feast needed a mechanism to validate data at retrieval time to catch these issues before they affect model training or serving.

Decision

Introduce a Data Quality Monitoring (DQM) module that validates datasets against user-curated rules, initially targeting historical retrieval (training dataset generation).

Design

The validation process uses a reference dataset and a profiler pattern:

  1. User prepares a reference dataset (saved from a known-good historical retrieval).

  2. User defines a profiler function that produces a profile (set of expectations) from a dataset.

  3. Validation is performed by comparing the tested dataset against the reference profile.

Integration with Great Expectations

The initial implementation uses Great Expectationsarrow-up-right as the validation engine:

Usage

Validation is triggered during historical feature retrieval via a validation_reference parameter:

If validation fails, a ValidationFailed exception is raised with details for all expectations that didn't pass. If validation succeeds, the materialized dataset is returned normally.

Key Decisions

  • Profiler-based approach: Users define their own validation rules via profiler functions rather than Feast prescribing fixed validation rules.

  • Great Expectations integration: Leverages an established data validation framework rather than building custom validation logic.

  • Validation at retrieval time: Validation is performed when datasets are materialized (.to_df() or .to_arrow()), not during ingestion.

  • ValidationReference as a registry object: Saved datasets and their validation references are stored in the Feast registry for reuse.

Consequences

Positive

  • Users can detect data quality issues before they affect model training.

  • Flexible profiler pattern allows custom validation rules per use case.

  • Integration with Great Expectations provides a rich set of built-in expectations.

  • Reference datasets provide a baseline for detecting data drift.

Negative

  • Currently limited to historical retrieval; online store write/read validation is planned but not yet implemented.

  • Dependency on Great Expectations adds to the install footprint (optional via feast[ge]).

  • Automatic profiling capabilities are limited; manual expectation crafting is recommended.

References

  • Original RFC: Feast RFC-027: Data Quality Monitoring

  • Implementation: sdk/python/feast/dqm/, sdk/python/feast/saved_dataset.py

Last updated

Was this helpful?