[Alpha] Data quality monitoring
Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules. Validation could be applied during:
Historical retrieval (training dataset generation)
[planned] Writing features into an online store
[planned] Reading features from an online store
Its goal is to address several complex data problems, namely:
Data consistency - new training datasets can be significantly different from previous datasets. This might require a change in model architecture.
Issues/bugs in the upstream pipeline - bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.
Training/serving skew - distribution shift could significantly decrease the performance of the model.
To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset. How exactly profile equivalency should be measured is up to the user.
Overview
The validation process consists of the following steps:
User prepares reference dataset (currently only saved datasets from historical retrieval are supported).
User defines profiler function, which should produce profile by given dataset (currently only profilers based on Great Expectations are allowed).
Validation of tested dataset is performed with reference dataset and profiler provided as parameters.
Preparations
Feast with Great Expectations support can be installed via
Dataset profile
Currently, Feast supports only Great Expectation's ExpectationSuite as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an ExpectationSuite.
Great Expectations supports automatic profiling as well as manually specifying expectations:
However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations:
Validating Training Dataset
During retrieval of historical features, validation_reference
can be passed as a parameter to methods .to_df(validation_reference=...)
or .to_arrow(validation_reference=...)
of RetrievalJob. If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned. Otherwise, feast.dqm.errors.ValidationFailed
exception would be raised. It will consist of all details for expectations that didn't pass.
Was this helpful?