# \[Alpha] Data quality monitoring

Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules.\
Validation could be applied during:

* Historical retrieval (training dataset generation)
* \[planned] Writing features into an online store
* \[planned] Reading features from an online store

Its goal is to address several complex data problems, namely:

* Data consistency - new training datasets can be significantly different from previous datasets. This might require a change in model architecture.
* Issues/bugs in the upstream pipeline - bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.
* Training/serving skew - distribution shift could significantly decrease the performance of the model.

> To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset.\
> How exactly profile equivalency should be measured is up to the user.

### Overview

The validation process consists of the following steps:

1. User prepares reference dataset (currently only [saved datasets](https://docs.feast.dev/v0.48-branch/getting-started/concepts/dataset) from historical retrieval are supported).
2. User defines profiler function, which should produce profile by given dataset (currently only profilers based on [Great Expectations](https://docs.greatexpectations.io) are allowed).
3. Validation of tested dataset is performed with reference dataset and profiler provided as parameters.

### Preparations

Feast with Great Expectations support can be installed via

```shell
pip install 'feast[ge]'
```

### Dataset profile

Currently, Feast supports only [Great Expectation's](https://greatexpectations.io/) [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite)\
as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite).

Great Expectations supports automatic profiling as well as manually specifying expectations:

```python
from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite

from feast.dqm.profilers.ge_profiler import ge_profiler

@ge_profiler
def automatic_profiler(dataset: Dataset) -> ExpectationSuite:
    from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

    return UserConfigurableProfiler(
        profile_dataset=dataset,
        ignored_columns=['conv_rate'],
        value_set_threshold='few'
    ).build_suite()
```

However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations:

```python
@ge_profiler
def manual_profiler(dataset: Dataset) -> ExpectationSuite:
    dataset.expect_column_max_to_be_between("column", 1, 2)
    return dataset.get_expectation_suite()
```

### Validating Training Dataset

During retrieval of historical features, `validation_reference` can be passed as a parameter to methods `.to_df(validation_reference=...)` or `.to_arrow(validation_reference=...)` of RetrievalJob.\
If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned.\
Otherwise, `feast.dqm.errors.ValidationFailed` exception would be raised. It will consist of all details for expectations that didn't pass.

```python
from feast import FeatureStore

fs = FeatureStore(".")

job = fs.get_historical_features(...)
job.to_df(
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=manual_profiler)
)
```
