LogoLogo
v0.41-branch
v0.41-branch
  • Introduction
  • Community & getting help
  • Roadmap
  • Changelog
  • Getting started
    • Quickstart
    • Architecture
      • Overview
      • Language
      • Push vs Pull Model
      • Write Patterns
      • Feature Transformation
      • Feature Serving and Model Inference
      • Role-Based Access Control (RBAC)
    • Concepts
      • Overview
      • Data ingestion
      • Entity
      • Feature view
      • Feature retrieval
      • Point-in-time joins
      • Permission
      • [Alpha] Saved dataset
    • Components
      • Overview
      • Registry
      • Offline store
      • Online store
      • Batch Materialization Engine
      • Provider
      • Authorization Manager
    • Third party integrations
    • FAQ
  • Tutorials
    • Sample use-case tutorials
      • Driver ranking
      • Fraud detection on GCP
      • Real-time credit scoring on AWS
      • Driver stats on Snowflake
    • Validating historical features with Great Expectations
    • Building streaming features
  • How-to Guides
    • Running Feast with Snowflake/GCP/AWS
      • Install Feast
      • Create a feature repository
      • Deploy a feature store
      • Build a training dataset
      • Load data into the online store
      • Read features from the online store
      • Scaling Feast
      • Structuring Feature Repos
    • Running Feast in production (e.g. on Kubernetes)
    • Customizing Feast
      • Adding a custom batch materialization engine
      • Adding a new offline store
      • Adding a new online store
      • Adding a custom provider
    • Adding or reusing tests
  • Reference
    • Codebase Structure
    • Type System
    • Data sources
      • Overview
      • File
      • Snowflake
      • BigQuery
      • Redshift
      • Push
      • Kafka
      • Kinesis
      • Spark (contrib)
      • PostgreSQL (contrib)
      • Trino (contrib)
      • Azure Synapse + Azure SQL (contrib)
    • Offline stores
      • Overview
      • Dask
      • Snowflake
      • BigQuery
      • Redshift
      • DuckDB
      • Spark (contrib)
      • PostgreSQL (contrib)
      • Trino (contrib)
      • Azure Synapse + Azure SQL (contrib)
      • Remote Offline
    • Online stores
      • Overview
      • SQLite
      • Snowflake
      • Redis
      • Dragonfly
      • IKV
      • Datastore
      • DynamoDB
      • Bigtable
      • Remote
      • PostgreSQL (contrib)
      • Cassandra + Astra DB (contrib)
      • MySQL (contrib)
      • Hazelcast (contrib)
      • ScyllaDB (contrib)
      • SingleStore (contrib)
    • Registries
      • Local
      • S3
      • GCS
      • SQL
      • Snowflake
    • Providers
      • Local
      • Google Cloud Platform
      • Amazon Web Services
      • Azure
    • Batch Materialization Engines
      • Snowflake
      • AWS Lambda (alpha)
      • Spark (contrib)
    • Feature repository
      • feature_store.yaml
      • .feastignore
    • Feature servers
      • Python feature server
      • [Alpha] Go feature server
      • Offline Feature Server
    • [Beta] Web UI
    • [Beta] On demand feature view
    • [Alpha] Vector Database
    • [Alpha] Data quality monitoring
    • [Alpha] Streaming feature computation with Denormalized
    • Feast CLI reference
    • Python API reference
    • Usage
  • Project
    • Contribution process
    • Development guide
    • Backwards Compatibility Policy
      • Maintainer Docs
    • Versioning policy
    • Release process
    • Feast 0.9 vs Feast 0.10+
Powered by GitBook
On this page
  • Overview
  • Preparations
  • Dataset profile
  • Validating Training Dataset

Was this helpful?

Edit on GitHub
Export as PDF
  1. Reference

[Alpha] Data quality monitoring

Previous[Alpha] Vector DatabaseNext[Alpha] Streaming feature computation with Denormalized

Last updated 6 months ago

Was this helpful?

Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules. Validation could be applied during:

  • Historical retrieval (training dataset generation)

  • [planned] Writing features into an online store

  • [planned] Reading features from an online store

Its goal is to address several complex data problems, namely:

  • Data consistency - new training datasets can be significantly different from previous datasets. This might require a change in model architecture.

  • Issues/bugs in the upstream pipeline - bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.

  • Training/serving skew - distribution shift could significantly decrease the performance of the model.

To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset. How exactly profile equivalency should be measured is up to the user.

Overview

The validation process consists of the following steps:

  1. User prepares reference dataset (currently only from historical retrieval are supported).

  2. User defines profiler function, which should produce profile by given dataset (currently only profilers based on are allowed).

  3. Validation of tested dataset is performed with reference dataset and profiler provided as parameters.

Preparations

Feast with Great Expectations support can be installed via

pip install 'feast[ge]'

Dataset profile

Great Expectations supports automatic profiling as well as manually specifying expectations:

from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite

from feast.dqm.profilers.ge_profiler import ge_profiler

@ge_profiler
def automatic_profiler(dataset: Dataset) -> ExpectationSuite:
    from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

    return UserConfigurableProfiler(
        profile_dataset=dataset,
        ignored_columns=['conv_rate'],
        value_set_threshold='few'
    ).build_suite()

However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations:

@ge_profiler
def manual_profiler(dataset: Dataset) -> ExpectationSuite:
    dataset.expect_column_max_to_be_between("column", 1, 2)
    return dataset.get_expectation_suite()

Validating Training Dataset

During retrieval of historical features, validation_reference can be passed as a parameter to methods .to_df(validation_reference=...) or .to_arrow(validation_reference=...) of RetrievalJob. If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned. Otherwise, feast.dqm.errors.ValidationFailed exception would be raised. It will consist of all details for expectations that didn't pass.

from feast import FeatureStore

fs = FeatureStore(".")

job = fs.get_historical_features(...)
job.to_df(
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=manual_profiler)
)

Currently, Feast supports only as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an .

saved datasets
Great Expectations
Great Expectation's
ExpectationSuite
ExpectationSuite