ADR-0011: Data Quality Monitoring
Status
Accepted
Context
Data quality issues can significantly impact ML model performance. Several complex data problems needed to be addressed:
Data consistency: New training datasets can differ significantly from previous datasets, potentially requiring changes in model architecture.
Upstream pipeline bugs: Bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.
Training/serving skew: Distribution shift between training and serving data can decrease model performance.
Feast needed a mechanism to validate data at retrieval time to catch these issues before they affect model training or serving.
Decision
Introduce a Data Quality Monitoring (DQM) module that validates datasets against user-curated rules, initially targeting historical retrieval (training dataset generation).
Design
The validation process uses a reference dataset and a profiler pattern:
User prepares a reference dataset (saved from a known-good historical retrieval).
User defines a profiler function that produces a profile (set of expectations) from a dataset.
Validation is performed by comparing the tested dataset against the reference profile.
Integration with Great Expectations
The initial implementation uses Great Expectations as the validation engine:
Usage
Validation is triggered during historical feature retrieval via a validation_reference parameter:
If validation fails, a ValidationFailed exception is raised with details for all expectations that didn't pass. If validation succeeds, the materialized dataset is returned normally.
Key Decisions
Profiler-based approach: Users define their own validation rules via profiler functions rather than Feast prescribing fixed validation rules.
Great Expectations integration: Leverages an established data validation framework rather than building custom validation logic.
Validation at retrieval time: Validation is performed when datasets are materialized (
.to_df()or.to_arrow()), not during ingestion.ValidationReference as a registry object: Saved datasets and their validation references are stored in the Feast registry for reuse.
Consequences
Positive
Users can detect data quality issues before they affect model training.
Flexible profiler pattern allows custom validation rules per use case.
Integration with Great Expectations provides a rich set of built-in expectations.
Reference datasets provide a baseline for detecting data drift.
Negative
Currently limited to historical retrieval; online store write/read validation is planned but not yet implemented.
Dependency on Great Expectations adds to the install footprint (optional via
feast[ge]).Automatic profiling capabilities are limited; manual expectation crafting is recommended.
References
Original RFC: Feast RFC-027: Data Quality Monitoring
Implementation:
sdk/python/feast/dqm/,sdk/python/feast/saved_dataset.pyDocumentation: Data Quality Monitoring
Last updated
Was this helpful?