1 of 81

v0.18-branch

Introduction

What is Feast?

Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Feast is able to serve feature data to models from a low-latency online store (for real-time prediction) or from an offline store (for scale-out batch scoring or model training).

Problems Feast Solves

Models need consistent access to data: Machine Learning (ML) systems built on traditional data infrastructure are often coupled to databases, object stores, streams, and files. A result of this coupling, however, is that any change in data infrastructure may break dependent ML systems. Another challenge is that dual implementations of data retrieval for training and serving can lead to inconsistencies in data, which in turn can lead to training-serving skew.

Feast decouples your models from your data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval. Feast also provides a consistent means of referencing feature data for retrieval, and therefore ensures that models remain portable when moving from training to serving.

Deploying new features into production is difficult: Many ML teams consist of members with different objectives. Data scientists, for example, aim to deploy features into production as soon as possible, while engineers want to ensure that production systems remain stable. These differing objectives can create an organizational friction that slows time-to-market for new features.

Feast addresses this friction by providing both a centralized registry to which data scientists can publish features and a battle-hardened serving layer. Together, these enable non-engineering teams to ship features into production with minimal oversight.

Models need point-in-time correct data: ML models in production require a view of data consistent with the one on which they are trained, otherwise the accuracy of these models could be compromised. Despite this need, many data science projects suffer from inconsistencies introduced by future feature values being leaked to models during training.

Feast solves the challenge of data leakage by providing point-in-time correct feature retrieval when exporting feature datasets for model training.

Features aren't reused across projects: Different teams within an organization are often unable to reuse features across projects. The siloed nature of development and the monolithic design of end-to-end ML systems contribute to duplication of feature creation and usage across teams and projects.

Feast addresses this problem by introducing feature reuse through a centralized registry. This registry enables multiple teams working on different projects not only to contribute features, but also to reuse these same features. With Feast, data scientists can start new ML projects by selecting previously engineered features from a centralized registry, and are no longer required to develop new features for each project.

Problems Feast does not yet solve

Feature engineering: We aim for Feast to support light-weight feature engineering as part of our API.

Feature discovery: We also aim for Feast to include a first-class user interface for exploring and discovering entities and features.

Feature validation: We additionally aim for Feast to improve support for statistics generation of feature data and subsequent validation of these statistics. Current support is limited.

What Feast is not

or system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Feast plans to include a light-weight feature engineering toolkit, but we encourage teams to integrate Feast with upstream ETL/ELT systems that are specialized in transformation.

Data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.

Data catalog: Feast is not a general purpose data catalog for your organization. Feast is purely focused on cataloging features for use in ML pipelines or systems, and only to the extent of facilitating the reuse of features.

How can I get started?

The best way to learn Feast is to use it. Head over to our and try it out!

Explore the following resources to get started with Feast:

is the fastest way to get started with Feast
describes all important Feast API concepts
describes Feast's overall architecture.

Community

Speak to us: Have a question, feature request, idea, or just looking to speak to a real person? Set up a meeting with a Feast maintainer over here!

Links & Resources

: Feel free to ask questions or say hello!
: We have both a user and developer mailing list.
- Feast users should join group by clicking .

How can I get help?

Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).
GitHub Issues: Found a bug or need a feature? .
StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to .

Community Calls

We have a user and contributor community call every two weeks (Asia & US friendly).

Please join the above Feast user groups in order to see calendar invites to the community calls

Frequency (alternating times every 2 weeks)

Tuesday 18:00 pm to 18:30 pm (US, Asia)
Tuesday 10:00 am to 10:30 am (US, Europe)

Getting started

Concepts

Overview

The top-level namespace within Feast is a . Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.

Project

Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (

Data source

The data source refers to raw underlying data (e.g. a table in BigQuery).

Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.

Below is an example data source with a single entity (driver) and two features (trips_today, and rating).

Entity

An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.

driver = Entity(name='driver', value_type=ValueType.STRING, join_key='driver_id')

Entities are typically defined as part of feature views. Entities are used to identify the primary key on which feature values should be stored and retrieved. These keys are used during the lookup of feature values from the online store and the join process in point-in-time joins. It is possible to define composite entities (more than one entity object) in a feature view. It is also possible for feature views to have zero entities. See feature view for more details.

Entities should be reused across feature views.

Entity key

A related concept is an entity key. These are one or more entity values that uniquely describe a feature view record. In the case of an entity (like a driver) that only has a single entity field, the entity is an entity key. However, it is also possible for an entity key to consist of multiple entity values. For example, a feature view with the composite entity of (customer, country) might have an entity key of (1001, 5).

Entity keys act as primary keys. They are used during the lookup of features from the online store, and they are also used to match feature rows across feature views during point-in-time joins.

Feature service

A feature service is an object that represents a logical group of features from one or more feature views. Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model, allowing for tracking of the features used by models.

from driver_ratings_feature_view import driver_ratings_fv
from driver_trips_feature_view import driver_stats_fv

Feature services are used during

The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features from the online store. The features retrieved from the online store may also belong to multiple feature views.

Applying a feature service does not result in an actual service being deployed.

Feature services can be retrieved from the feature store, and referenced when retrieving features from the online store.

Feature services can also be used when retrieving historical features from the offline store.

Feature retrieval

Dataset

A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.

Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.

Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.

Feature References

Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>

Feature references are used for the retrieval of features from Feast:

It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, It is not possible to reference (or retrieve) features from multiple projects at the same time.

Note, if you're using , then those features can be added here without additional entity values in the entity_rows

Event timestamp

The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.

Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.

Point-in-time joins

Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today, and earnings_today):

The above table can be registered with Feast through the following feature view:

Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.

Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats feature view onto it, while preserving the

Dataset

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

Dataset can be created from:

Results of historical retrieval

Architecture

Overview Feature repository https://github.com/feast-dev/feast/blob/v0.18-branch/docs/getting-started/architecture-and-components/untitled.md Offline store Online store Provider

Overview

Functionality

Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.

Feature repository

Feast users use Feast to manage two important sets of configuration:

Configuration about how to run Feast on your infrastructure
Feature definitions

With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.

Registry

The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.

Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports three different backends

Local: Used as a local backend for storing the registry during development

Offline store

Feast uses offline stores as storage and compute systems. Offline stores store historic time-series feature values. Feast does not generate these features, but instead uses the offline store as the interface for querying existing features in your organization.

Offline stores are used primarily for two reasons

Building training datasets from time-series features.
Materializing (loading) features from the offline store into an online store in order to serve those features at low latency for prediction.

Online store

The Feast online store is used for low-latency online feature value lookups. Feature values are loaded into the online store from data sources in feature views using the materialize command.

The storage schema of features within the online store mirrors that of the data source used to populate the online store. One key difference between the online store and data sources is that only the latest feature values are stored per entity key. No historical values are stored.

Example batch data source

Once the above data source is materialized into Feast (using feast materialize), the feature values will be stored as follows:

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

Tutorials

Overview

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Fraud detection on GCP Driver ranking Real-time credit scoring on AWS Driver stats on Snowflake Validating historical features with Great Expectations

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

Driver Ranking Example

This tutorial guides you on how to use Feast with Scikit-learn. You will learn how to:

Train a model locally (on your laptop) using data from
Test the model for online inference using (for fast iteration)
Test the model for online inference using (for production use)

Try it and let us know what you think!

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our end-to-end example will perform the following workflows:

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast

Install Feast using :

Install Feast with Snowflake dependencies (required when using Snowflake):

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use feast init command:

The init

Deploy a feature store

The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your feature_store.yaml file, as well as the feature definitions found in your feature repository.

Here we'll be using the example repository we created in the previous guide, Create a feature store. You can re-create it by running feast init in a new directory.

Deploying

To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize. See for more details.

Cleaning up

If you need to clean up the infrastructure created by feast apply, use the teardown command.

Warning: teardown is an irreversible command and will remove all feature store infrastructure. Proceed with caution!

****

Load data into the online store

Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

Materializing features

1. Register feature views

Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

2.a Materialize

The materialize command allows users to materialize features over a specific historical time range into the online store.

The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

It is also possible to materialize for specific feature views by using the -v / --views argument.

The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

2.b Materialize Incremental (Alternative)

For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize, materialize-incremental will track the state of previous ingestion runs inside of the feature registry.

The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00).

The materialize-incremental command functions similarly to materialize in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.

Unlike materialize, materialize-incremental automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental, the end timestamp will be tracked.

Subsequent runs of materialize-incremental will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Retrieving online features

1. Ensure that feature values have been loaded into the online store

Please ensure that you have materialized (loaded) your feature values into the online store before starting

2. Define feature references

Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

3. Read online features

Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

Deploying a Java feature server on Kubernetes

This tutorial guides you on how to:

Define features and data sources in Feast using the Feast CLI
Materialize features to a Redis cluster deployed on Kubernetes.
Deploy a Feast Java feature server into a Kubernetes cluster using the Feast helm charts
Retrieve features using the gRPC API exposed by the Feast Java server

Try it and let us know what you think!

Reference

Data sources

Please see for an explanation of data sources.

File

Description

File data sources allow for the retrieval of historical feature values from files on disk for building training datasets, as well as for materializing features into an online store.

FileSource is meant for development purposes only and is not optimized for production use.

Example

Configuration options are available .

Snowflake

Description

Snowflake data sources allow for the retrieval of historical feature values from Snowflake for building training datasets as well as materializing features into an online store.

Either a table reference or a SQL query can be provided.

BigQuery

Description

BigQuery data sources allow for the retrieval of historical feature values from BigQuery for building training datasets as well as materializing features into an online store.

Either a table reference or a SQL query can be provided.
No performance guarantees can be provided over SQL query-based sources. Please use table references where possible.

Examples

Using a table reference

Using a query

Configuration options are available .

Redshift

Description

Redshift data sources allow for the retrieval of historical feature values from Redshift for building training datasets as well as materializing features into an online store.

Either a table name or a SQL query can be provided.
No performance guarantees can be provided over SQL query-based sources. Please use table references where possible.

Examples

Using a table name

Using a query

Configuration options are available .

Offline stores

Please see Offline Store for an explanation of offline stores.

File Snowflake BigQuery Redshift

File

Description

The File offline store provides support for reading .

Only Parquet files are currently supported.

Snowflake

Description

The Snowflake offline store provides support for reading .

Snowflake tables and views are allowed as sources.

BigQuery

Description

The BigQuery offline store provides support for reading BigQuerySources.

BigQuery tables and views are allowed as sources.
All joins happen within BigQuery.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be uploaded to BigQuery in order to complete join operations.
A is returned when calling get_historical_features().

Example

Configuration options are available .

Online stores

Please see Online Store for an explanation of online stores.

SQLite Redis Datastore DynamoDB

SQLite

Description

The online store provides support for materializing feature values into an SQLite database for serving online features.

All feature values are stored in an on-disk SQLite database

Redis

Description

The online store provides support for materializing feature values into Redis.

Both Redis and Redis Cluster are supported

Datastore

Description

The online store provides support for materializing feature values into Cloud Datastore. The data model used to store feature values in Datastore is described in more detail .

Providers

Please see Provider for an explanation of providers.

Local Google Cloud Platform Amazon Web Services

Local

Description

Offline Store: Uses the File offline store by default. Also supports BigQuery as the offline store.

Amazon Web Services

Description

Offline Store: Uses the Redshift offline store by default. Also supports File as the offline store.

.feastignore

Overview

.feastignore is a file that is placed at the root of the . This file contains paths that should be ignored when running feast apply. An example .feastignore is shown below:

.feastignore

Feature servers

Feast users can choose to retrieve features from a feature server, as opposed to through the Python SDK.

[Alpha] On demand feature view

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

To enable this feature, run feast alpha enable on_demand_transforms

[Alpha] Stream ingestion

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

To enable this feature, run feast alpha enable direct_ingest_to_online_store

Usage

How Feast SDK usage is measured

The Feast project logs anonymous usage statistics and errors in order to inform our planning. Several client methods are tracked, beginning in Feast 0.9. Users are assigned a UUID which is sent along with the name of the method, the Feast version, the OS (using sys.platform), and the current time.

The is available here.

Project

Contribution process

We use RFCs and GitHub issues to communicate development ideas. The simplest way to contribute to Feast is to leave comments in our RFCs in the Feast Google Drive or our GitHub issues. You will need to join our Google Group in order to get access.

We follow a process of lazy consensus. If you believe you know what the project needs then just start development. If you are unsure about which direction to take with development then please communicate your ideas through a GitHub issue or through our Slack Channel before starting development.

Please submit a PR to the master branch of the Feast repository once you are ready to submit your contribution. Code submission to Feast (including submission from project maintainers) require review and approval from maintainers or code owners.

PRs that are submitted by the general public need to be identified as ok-to-test. Once enabled, will run a range of tests to verify the submission, after which community members will help to review the pull request.

Please sign the in order to have your code merged into the Feast repository.

Adding a new offline store

Overview

Feast makes adding support for a new offline store (database) easy. Developers can simply implement the OfflineStore interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).

In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.

The full working code for this guide can be found at .

The process for using a custom offline store consists of 4 steps:

Defining an OfflineStore class.
Defining an OfflineStoreConfig class.
Defining a RetrievalJob

1. Defining an OfflineStore class

OfflineStore class names must end with the OfflineStore suffix!

The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.

There are two methods that deal with reading data from the offline storesget_historical_featuresand pull_latest_from_table_or_query.

pull_latest_from_table_or_query is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method. This method pull data from the offline store, and the FeatureStore class takes care of writing this data into the online store.
get_historical_features

2. Defining an OfflineStoreConfig class

Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.

To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

This config class must container a type field, which contains the fully qualified class name of its corresponding OfflineStore class.

Additionally, the name of the config class must be the same as the OfflineStore class, with the Config suffix.

An example of the config class for the custom file offline store :

This configuration can be specified in the feature_store.yaml as follows:

This configuration information is available to the methods of the OfflineStore, via theconfig: RepoConfig parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store field of the config parameter.

3. Defining a RetrievalJob class

The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob instance, which represents the execution of the actual query against the underlying store.

Custom offline stores may need to implement their own instances of the RetrievalJob interface.

The RetrievalJob interface exposes two methods - to_df and to_arrow. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.

4. Defining a DataSource class for the offline store

Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.

The data source class should implement two methods - from_proto, and to_proto.

For custom offline stores that are not being implemented in the main feature repo, the custom_options field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto method and reading the value back from bytes in the from_proto method.

5. Using the custom offline store

After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the offline_store field. The value specified should be the fully qualified class name of the OfflineStore.

As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our custom file offline store, we can use the following feature_store.yaml:

If additional configuration for the offline store is **not **required, then we can omit the other fields and only specify the type of the offline store class as the value for the offline_store.

Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.

6. Testing the OfflineStore class

Even if you have created the OfflineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. In the Feast submodule, we can run all the unit tests with:

The universal tests, which are integration tests specifically intended to test offline and online stores, can be run with:

The unit tests should succeed, but the universal tests will likely fail. The tests are parametrized based on the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py. To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. The main challenge there will be to write a DataSourceCreator for the offline store. In this repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_offline_store/feast_tests.py, so you would run

to test the offline store against the Feast universal tests. You should notice that some of the tests actually fail; this indicates that there is a mistake in the implementation of this offline store!

Adding a new online store

Overview

Feast makes adding support for a new online store (database) easy. Developers can simply implement the OnlineStore interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore).

In this guide, we will show you how to integrate with MySQL as an online store. While we will be implementing a specific store, this guide should be representative for adding support for any new online store.

The full working code for this guide can be found at feast-dev/feast-custom-online-store-demo.

The process of using a custom online store consists of 3 steps:

Defining the OnlineStore class.
Defining the OnlineStoreConfig class.
Referencing the OnlineStore

1. Defining an OnlineStore class

OnlineStore class names must end with the OnlineStore suffix!

The OnlineStore class broadly contains two sets of methods

One set deals with managing infrastructure that the online store needed for operations
One set deals with writing data into the store, and reading data from the store.

1.1 Infrastructure Methods

There are two methods that deal with managing infrastructure for online stores, update and teardown

update is invoked when users run feast apply as a CLI command, or the FeatureStore.apply() sdk method.

The update method should be used to perform any operations necessary before data can be written to or read from the store. The update method can be used to create MySQL tables in preparation for reads and writes to new feature views.

teardown is invoked when users run feast teardown or FeatureStore.teardown().

The teardown method should be used to perform any clean-up operations. teardown can be used to drop MySQL indices and tables corresponding to the feature views being deleted.

1.2 Read/Write Methods

There are two methods that deal with writing data to and from the online stores.online_write_batch and online_read.

online_write_batch is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method.
online_read is invoked when reading values from the online store using the FeatureStore.get_online_features()

2. Defining an OnlineStoreConfig class

Additional configuration may be needed to allow the OnlineStore to talk to the backing store. For example, MySQL may need configuration information like the host at which the MySQL instance is running, credentials for connecting to the database, etc.

To facilitate configuration, all OnlineStore implementations are required to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

This config class must container a type field, which contains the fully qualified class name of its corresponding OnlineStore class.

Additionally, the name of the config class must be the same as the OnlineStore class, with the Config suffix.

An example of the config class for MySQL :

This configuration can be specified in the feature_store.yaml as follows:

This configuration information is available to the methods of the OnlineStore, via theconfig: RepoConfig parameter which is passed into all the methods of the OnlineStore interface, specifically at the config.online_store field of the config parameter.

3. Using the custom online store

After implementing both these classes, the custom online store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the online_store field. The value specified should be the fully qualified class name of the OnlineStore.

As long as your OnlineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our MySQL online store, we can use the following feature_store.yaml:

If additional configuration for the online store is **not **required, then we can omit the other fields and only specify the type of the online store class as the value for the online_store.

4. Testing the OnlineStore class

Even if you have created the OnlineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. In the Feast submodule, we can run all the unit tests with:

The universal tests, which are integration tests specifically intended to test offline and online stores, can be run with:

The unit tests should succeed, but the universal tests will likely fail. The tests are parametrized based on the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py. To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. In this repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_online_store/feast_tests.py, so you would run

to test the MySQL online store against the Feast universal tests. You should notice that some of the tests actually fail; this indicates that there is a mistake in the implementation of this online store!

Running Feast in production

Overview

After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise.

Overview of typical production configuration is given below:

Important note: We're trying to keep Feast modular. With the exception of the core, most of the Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration. For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features.

Furthermore, there's no single "true" approach. As you will see in this guide, Feast usually provides several options for each problem. It's totally up to you to pick a path that's better suited to your needs.

In this guide we will show you how to:

Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date
Use Feast for model training and serving

1. Automatically deploying changes to your feature definitions

The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits.

Most teams will need to have a feature store deployed to more than one environment. We have created an example repository () which contains two Feast projects, one per environment.

The contents of this repository are shown below:

The repository contains three sub-folders:

staging/: This folder contains the staging feature_store.yaml and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.
production/: This folder contains the production feature_store.yaml and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.

The feature_store.yaml contains the following:

Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply are tracked in the registry.db. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.

It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.

If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:

In summary, once you have set up a Git based repository with CI that runs feast apply on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.

2. How to load data into your online store and keep it up to date

To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.

2.1. Manual materializations

The simplest way to schedule materialization is to run an incremental materialization using the Feast CLI:

The above command will load all feature values from all feature view sources into the online store up to the time 2022-01-01T00:00:00.

A timestamp is required to set the end date for materialization. If your source is fully up to date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time materialize-incremental is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.

An alternative approach to incremental materialization (where Feast tracks the intervals of data that need to be ingested), is to call Feast directly from your scheduler like Airflow. In this case, Airflow is the system that tracks the intervals that have been ingested.

In the above example we are materializing the source data from the driver_hourly_stats feature view over a day. This command can be scheduled as the final operation in your Airflow ETL, which runs after you have computed your features and stored them in the source location. Feast will then load your feature data into your online store.

The timestamps above should match the interval of data that has been computed by the data transformation system.

2.2. Automate periodic materializations

It is up to you which orchestration/scheduler to use to periodically run $ feast materialize. Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.

If you are using Airflow as a scheduler, Feast can be invoked through the after the has been installed into a virtual environment and your feature repo has been synced:

Important note: Airflow worker must have read and write permissions to the registry file on GS / S3 since it pulls configuration and updates materialization history.

3. How to use Feast for model training

After we've defined our features and data sources in the repository, we can generate training datasets.

The first thing we need to do in our training code is to create a FeatureStore object with a path to the registry.

One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml to those pipelines. This feature_store.yaml file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.

Then, training data can be retrieved as follows:

The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the list of feature references also be saved alongside the model. This ensures that models and the features they are trained on are paired together when being shipped into production:

To test your model locally, you can simply create a FeatureStore object, fetch online features, and then make a prediction:

It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.

4. Retrieving online features for prediction

Once you have successfully loaded (or in Feast terminology materialized) your data from batch sources into the online store, you can start consuming features for model inference. There are three approaches for that purpose sorted from the most simple one (in an operational sense) to the most performant (benchmarks to be published soon):

4.1. Use the Python SDK within an existing Python service

This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.

To integrate online retrieval into your service use the following code:

4.2. Consume features via HTTP API from Serverless Feature Server

If you don't want to add the Feast Python SDK as a dependency, or your feature retrieval service is written in a non-Python language, Feast can deploy a simple feature server on serverless infrastructure (eg, AWS Lambda, Google Cloud Run) for you. This service will provide an HTTP API with JSON I/O, which can be easily used with any programming language.

4.3. Java based Feature Server deployed on Kubernetes

For users with very latency-sensitive and high QPS use-cases, Feast offers a high-performance Java feature server. Besides the benefits of running on JVM, this implementation also provides a gRPC API, which guarantees good connection utilization and small request / response body size (compared to JSON). You will need the Feast Java SDK to retrieve features from this service. This SDK wraps all the gRPC logic for you and provides more convenient APIs.

The Java based feature server can be deployed to Kubernetes cluster via Helm charts in a few simple steps:

Install and
Add the Feast Helm repository and download the latest charts:

Run Helm Install

This chart will deploy two services: feature-server and transformation-service. Both must have read access to the registry file on cloud storage. Both will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.

Load balancing

The next step would be to install an L7 Load Balancer (eg, ) in front of the Java feature server. For seamless integration with Kubernetes (including services created by Feast Helm chart) we recommend using as Envoy's orchestrator.

5. Ingesting features from a stream source

Recently Feast added functionality for . Please note that this is still in an early phase and new incompatible changes may be introduced.

5.1. Using Python SDK in your Apache Spark / Beam pipeline

The default option to write features from a stream is to add the Python SDK into your existing PySpark / Beam pipeline. Feast SDK provides writer implementation that can be called from foreachBatch stream writer in PySpark like this:

5.2. Push service (still under development)

Alternatively, if you want to ingest features directly from a broker (eg, Kafka or Kinesis), you can use the "push service", which will write to an online store. This service will expose an HTTP API or when deployed on Serverless platforms like AWS Lambda or Google Cloud Run, this service can be directly connected to Kinesis or PubSub.

If you are using Kafka, could be utilized as a middleware. In this case, the "push service" can be deployed on Kubernetes or as a Serverless function.

6. Monitoring

Feast services can report their metrics to a StatsD-compatible collector. To activate this function, you'll need to provide a StatsD IP address and a port when deploying the helm chart (in future, this will be added to feature_store.yaml).

We use an for StatsD format to be able to send tags along with metrics. Keep that in mind while selecting the collector ( will work for sure).

We chose StatsD since it's a de-facto standard with various implementations (eg, , ) and metrics can be easily exported to Prometheus, InfluxDB, AWS CloudWatch, etc.

Summary

Summarizing it all together we want to show several options of architecture that will be most frequently used in production:

Option #1 (currently preferred)

Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast registry
Airflow manages materialization jobs to ingest data from DWH to the online store periodically
For the stream ingestion Feast Python SDK is used in the existing Spark / Beam pipeline

Option #2 (still in development)

Same as Option #1, except:

Push service is deployed as AWS Lambda / Google Cloud Run and is configured as a sink for Kinesis or PubSub to ingest features directly from a stream broker. Lambda / Cloud Run is being managed by Feast SDK (from CI environment)
Materialization jobs are managed inside Kubernetes via Kubernetes Job (currently not managed by Helm)

Option #3 (still in development)

Same as Option #2, except:

Push service is deployed on Kubernetes cluster and exposes an HTTP API that can be used as a sink for Kafka (via kafka-http connector) or accessed directly.

v0.18-branch

Introduction

hashtagWhat is Feast?

hashtagProblems Feast Solves

hashtagProblems Feast does not yet solve

hashtagWhat Feast is not

hashtagHow can I get started?

Community

hashtagLinks & Resources

hashtagHow can I get help?

hashtagCommunity Calls

hashtagFrequency (alternating times every 2 weeks)

hashtagLinks

Getting started

Concepts

Overview

hashtagProject

Data source

Entity

hashtagEntity key

Feature service

Feature retrieval

hashtagDataset

hashtagFeature References

hashtagEvent timestamp

Point-in-time joins

Dataset

Architecture

Overview

hashtagFunctionality

Feature repository

Registry

Offline store

Online store

Provider

Tutorials

Overview

Driver ranking

hashtagDriver Ranking Examplearrow-up-right

Fraud detection on GCP

hashtag

Real-time credit scoring on AWS

hashtag

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast

Create a feature repository

Deploy a feature store

hashtagDeploying

hashtagCleaning up

Load data into the online store

hashtagMaterializing features

hashtag1. Register feature views

hashtag2.a Materialize

hashtag2.b Materialize Incremental (Alternative)

Read features from the online store

hashtagRetrieving online features

hashtag1. Ensure that feature values have been loaded into the online store

hashtag2. Define feature references

hashtag3. Read online features

Deploying a Java feature server on Kubernetes

Reference

Data sources

File

hashtagDescription

hashtagExample

Snowflake

hashtagDescription

BigQuery

hashtagDescription

hashtagExamples

Redshift

hashtagDescription

hashtagExamples

Offline stores

File

hashtagDescription

Snowflake

hashtagDescription

BigQuery

What is Feast?

Problems Feast Solves

Problems Feast does not yet solve

What Feast is not

How can I get started?

Links & Resources

How can I get help?

Community Calls

Frequency (alternating times every 2 weeks)

Links

Project

Entity key

Dataset

Feature References

Event timestamp

Functionality

Driver Ranking Example

Deploying

Cleaning up

Materializing features

1. Register feature views

2.a Materialize

2.b Materialize Incremental (Alternative)

Retrieving online features

1. Ensure that feature values have been loaded into the online store

2. Define feature references

3. Read online features

Description

Example

Description

Description

Examples

Description

Examples

Description

Description

Description

Example

Description

Description

Description

Description

Description

Overview

How Feast SDK usage is measured

Links & Resources

How can I get help?

Community Calls

Frequency (alternating times every 2 weeks)

Links

What is Feast?

Problems Feast Solves

Problems Feast does not yet solve

What Feast is not

How can I get started?

Dataset

Feature References

Event timestamp

Entity key

Description

Example

Materializing features

1. Register feature views

2.a Materialize

2.b Materialize Incremental (Alternative)

Functionality