1 of 100

v0.37-branch

Introduction

Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.

Feast allows ML platform teams to:

Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).

Community & getting help

Links & Resources

GitHub Repository: Find the complete Feast codebase on GitHub.
- Community Governance Doc: See the governance model of Feast, including who the maintainers are and how decisions are made.
Google Folder: This folder is used as a central repository for all Feast resources. For example:
- Design proposals in the form of Request for Comments (RFC).
- User surveys and meeting minutes.
- Slide decks of conferences our contributors have spoken at.
: Our LFAI wiki page contains links to resources for contributors and maintainers.

GitHub Issues: Found a bug or need a feature? .

Getting started

Concepts

Overview Data ingestion Entity Feature view Feature retrieval Point-in-time joins Registry [Alpha] Saved dataset

Entity

An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.

The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.

Entities are used by Feast in many contexts, as we explore below:

Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:

[Alpha] Saved dataset

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

Dataset can be created from:

Results of historical retrieval

Architecture

Overview Registry Offline store Online store Batch Materialization Engine Provider

Registry

The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.

Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports four different backends.

Local: Used as a local backend for storing the registry during development
S3

Offline store

An offline store is an interface for working with historical time-series feature values that are stored in data sources. The OfflineStore interface has several different implementations, such as the BigQueryOfflineStore, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see Offline Stores.

Offline stores are primarily used for two reasons:

Building training datasets from time-series features.
Materializing (loading) features into an online store to serve those features at low-latency in a production setting.

Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.

Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery offline store cannot be used to query a file-based data source.

Please see for more details on how to push features directly to the offline store in your feature store.

Online store

Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize command.

The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each entity key, only the latest feature values are stored. No historical values are stored.

Here is an example batch data source:

Once the above data source is materialized into Feast (using feast materialize), the feature values will be stored as follows:

Features can also be written directly to the online store via .

Batch Materialization Engine

A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.

A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).

If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see this guide for more details.

Please see feature_store.yaml for configuring engines.

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

Third party integrations

We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.

See

In order for a plugin integration to be highlighted, it must meet the following requirements:

The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.

Tutorials

Sample use-case tutorials

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Driver ranking Fraud detection on GCP Real-time credit scoring on AWS Driver stats on Snowflake

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

This tutorial guides you on how to use Feast with . You will learn how to:

Train a model locally (on your laptop) using data from
Test the model for online inference using

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our end-to-end example will perform the following workflows:

Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

This end-to-end tutorial will take you through the following steps:

Deploying S3 with Parquet as your primary data source, containing both

Building streaming features

Feast supports registering streaming feature views and Kafka and Kinesis streaming sources. It also provides an interface for stream processing called the Stream Processor. An example Kafka/Spark StreamProcessor is implemented in the contrib folder. For more details, please see the RFC for more details.

Please see here for a tutorial on how to build a versioned streaming pipeline that registers your transformations, features, and data sources in Feast.

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast Create a feature repository Deploy a feature store Build a training dataset Load data into the online store Read features from the online store Scaling Feast Structuring Feature Repos

Install Feast

Install Feast using pip:

pip install feast

Install Feast with Snowflake dependencies (required when using Snowflake):

pip install 'feast[snowflake]'

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

pip install 'feast[gcp]'

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use feast init command:

The init command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:

Deploy a feature store

The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your file, as well as the feature definitions found in your feature repository.

To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

Load data into the online store

Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

The materialize command allows users to materialize features over a specific historical time range into the online store.

The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

It is also possible to materialize for specific feature views by using the -v / --views argument.

The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Please ensure that you have materialized (loaded) your feature values into the online store before starting

Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

Scaling Feast

Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.

The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).

The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.

Customizing Feast

Feast is highly pluggable and configurable:

One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.
The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.

Below are some guides on how to add new custom components:

Adding a new offline store Adding a new online store Adding a custom batch materialization engine Adding a custom provider

Reference

Type System

Feast uses an internal type system to provide guarantees on training and serving data. Feast currently supports eight primitive types - INT32, INT64, FLOAT32, FLOAT64, STRING, BYTES, BOOL, and UNIX_TIMESTAMP - and the corresponding array types. Null types are not supported, although the UNIX_TIMESTAMP type is nullable. The type system is controlled by in protobuf and by in Python. Type conversion logic can be found in

Data sources

Please see Data Source for a conceptual explanation of data sources.

Overview File Snowflake BigQuery Redshift Push Kafka Kinesis Spark (contrib)PostgreSQL (contrib)Trino (contrib)Azure Synapse + Azure SQL (contrib)

File

Description

File data sources are files on disk or on S3. Currently only Parquet files are supported.

FileSource is meant for development purposes only and is not optimized for production use.

Example

The full set of configuration options is available here.

Supported Types

File data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.

Snowflake

Snowflake data sources are Snowflake tables or views. These can be specified either by a table reference or a SQL query.

Using a table reference:

Using a query:

The full set of configuration options is available .

Snowflake data sources support all eight primitive types. Array types are also supported but not with type inference. For a comparison against other batch data sources, please see .

BigQuery

BigQuery data sources are BigQuery tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Using a table reference:

Using a query:

The full set of configuration options is available .

BigQuery data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Redshift

Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Using a table name:

Using a query:

The full set of configuration options is available .

Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

Spark (contrib)

Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.

The Spark data source does not achieve full test coverage. Please do not assume complete stability.

Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):

Using a query:

Using a file reference:

The full set of configuration options is available .

Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

PostgreSQL (contrib)

PostgreSQL data sources are PostgreSQL tables or views. These can be specified either by a table reference or a SQL query.

The PostgreSQL data source does not achieve full test coverage. Please do not assume complete stability.

Defining a Postgres source:

The full set of configuration options is available .

PostgreSQL data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Trino (contrib)

Trino data sources are Trino tables or views. These can be specified either by a table reference or a SQL query.

The Trino data source does not achieve full test coverage. Please do not assume complete stability.

Defining a Trino source:

The full set of configuration options is available .

Trino data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

Azure Synapse + Azure SQL (contrib)

Description

MsSQL data sources are Microsoft sql table sources. These can be specified either by a table reference or a SQL query.

Disclaimer

The MsSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a MsSQL source:

from feast.infra.offline_stores.contrib.mssql_offline_store.mssqlserver_source import (
    MsSqlServerSource,
)

driver_hourly_table = "driver_hourly"

driver_source = MsSqlServerSource(
    table_ref=driver_hourly_table,
    event_timestamp_column="datetime",
    created_timestamp_column="created",
)

Offline stores

Please see Offline Store for a conceptual explanation of offline stores.

Overview File Snowflake BigQuery Redshift Spark (contrib)PostgreSQL (contrib)Trino (contrib)Azure Synapse + Azure SQL (contrib)

Online stores

Please see Online Store for an explanation of online stores.

Overview SQLite Snowflake Redis Dragonfly Datastore DynamoDB Bigtable PostgreSQL (contrib)Cassandra + Astra DB (contrib)MySQL (contrib)Rockset (contrib)Hazelcast (contrib)ScyllaDB (contrib)

Providers

Please see for an explanation of providers.

Local

Offline Store: Uses the File offline store by default. Also supports BigQuery as the offline store.
Online Store: Uses the Sqlite online store by default. Also supports Redis and Datastore as online stores.

AWS Lambda (alpha)

The AWS Lambda batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via to_remote_storage, and then loads them into the online store.

See for configuration options.

See also for a Dockerfile that can be used below with materialization_image.

Codebase Structure

Let's examine the Feast codebase. This analysis is accurate as of Feast 0.23.

$ tree -L 1 -d
.
├── docs
├── examples
├── go
├── infra
├── java
├── protos
├── sdk
└── ui

Python SDK

The Python SDK lives in sdk/python/feast. The majority of Feast logic lives in these Python files:

The core Feast objects (entities, feature views, data sources, etc.) are defined in their respective Python files, such as entity.py, feature_view.py, and data_source.py.
The FeatureStore class is defined in feature_store.py and the associated configuration object (the Python representation of the feature_store.yaml file) are defined in repo_config.py.
The CLI and other core feature store logic are defined in cli.py and repo_operations.py.
The type system that is used to manage conversion between Feast types and external typing systems is managed in type_map.py.
The Python feature server (the server that is started through the feast serve command) is defined in feature_server.py.

There are also several important submodules:

infra/ contains all the infrastructure components, such as the provider, offline store, online store, batch materialization engine, and registry.
dqm/ covers data quality monitoring, such as the dataset profiler.
diff/ covers the logic for determining how to apply infrastructure changes upon feature repo changes (e.g. the output of

Of these submodules, infra/ is the most important. It contains the interfaces for the , , , , and , as well as all of their individual implementations.

The tests for the Python SDK are contained in sdk/python/tests. For more details, see this of the test suite.

Let's walk through how feast apply works by tracking its execution across the codebase.

All CLI commands are in cli.py. Most of these commands are backed by methods in repo_operations.py. The feast apply command triggers apply_total_command, which then calls apply_total in repo_operations.py.
With a FeatureStore

At this point, the feast apply command is complete.

Let's walk through how feast materialize works by tracking its execution across the codebase.

The feast materialize command triggers materialize_command in cli.py, which then calls FeatureStore.materialize from feature_store.py.
This then calls Provider.materialize_single_feature_view, which can be found in infra/provider.py

Let's walk through how get_historical_features works by tracking its execution across the codebase.

We start with FeatureStore.get_historical_features in feature_store.py. This method does some internal preparation, and then delegates the actual execution to the underlying provider by calling Provider.get_historical_features, which can be found in infra/provider.py.
As with feast apply, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.get_historical_features

The java/ directory contains the Java serving component. See for more details on how the repo is structured.

The go/ directory contains the Go feature server. Most of the files here have logic to help with reading features from the online store. Within go/, the internal/feast/ directory contains most of the core logic:

onlineserving/ covers the core serving logic.
model/ contains the implementations of the Feast objects (entity, feature view, etc.).
- For example, entity.go

Feast uses to store serialized versions of the core Feast objects. The protobuf definitions are stored in protos/feast.

The consists of the serialized representations of the Feast objects.

Typically, changes being made to the Feast objects require changes to their corresponding protobuf representations. The usual best practices for making changes to protobufs should be followed ensure backwards and forwards compatibility.

The ui/ directory contains the Web UI. See for more details on the structure of the Web UI.

FAQ

Don't see your question?

We encourage you to ask questions on . Even better, once you get an answer, add the answer to this FAQ via a !

Getting started

Do you have any examples of how Feast should be used?

The quickstart is the easiest way to learn about Feast. For more detailed tutorials, please check out the tutorials page.

Concepts

Do feature views have to include entities?

No, there are feature views without entities.

How does Feast handle model or feature versioning?

Feast expects that each version of a model corresponds to a different feature service.

Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, feast plan and feast apply will throw errors if it sees this kind of behavior.

What is the difference between data sources and the offline store?

The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see data sources and offline store for more details.

Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.

Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See

Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).

It is a good idea though to lock down the registry file so only the CI/CD pipeline can modify it. That way data scientists and other users cannot accidentally modify the registry and lose other team's data.

Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support . Feast also defines a that allows a deeper integration with stream sources.

There are several kinds of transformations:

On demand transformations (See )
- These transformations are Pandas transformations run on batch data when you call get_historical_features and at online serving time when you call `get_online_features.
- Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well

Yes. See .

A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.

Please see a detailed comparison of Feast vs. Tecton . For another comparison, please see .

Feast is designed to work at scale and support low latency online serving. See our for details.

Yes. Specifically:

Simple lists / dense embeddings:
- BigQuery supports list types natively
- Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)

The list of supported offline and online stores can be found and , respectively. The indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about customizing Feast .

Yes. Using a GCP or AWS provider in feature_store.yaml primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.

The data source and the offline store are closely tied, but separate concepts. The offline store controls how feast talks to a data store for historical feature retrieval, and the data source points to specific table (or query) within a data store. Offline stores are infrastructure-level connectors to data stores like Snowflake.

Additional differences:

Data sources may be specific to a project (e.g. feed ranking), but offline stores are agnostic and used across projects.
A feast project may define several data sources that power different feature views, but a feast project has a single offline store.
Feast users typically need to define data sources when using feast, but only need to use/configure existing offline stores without creating new ones.

Please follow the instructions .

Yes. For example, the Postgres connector can be used as both an offline and online store (as well as the registry).

Yes. There are two ways to use S3 in Feast:

Using Redshift as a data source via Spectrum (), and then continuing with the guide. See a we did on this at our apply() meetup.
Using the s3_endpoint_override in a FileSource data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.

Please see the .

For more details on contributing to the Feast community, see and this .

Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this for more details.

Please see this . If you have any questions or suggestions, feel free to leave a comment on the document!

Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the .

Feature view

Feature views

Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.

A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see [Alpha] On demand feature views).

Feature views consist of:

a data source
zero or more entities
- If the features are not related to a specific object, the feature view might not have entities; see below.
a name to uniquely identify this feature view in the project.
(optional, but recommended) a schema specifying one or more (without this, Feast will infer the schema by reading from the data source)
(optional, but recommended) metadata (for example, description, or other free-form metadata via tags)
(optional) a TTL, which limits how far back Feast will look when generating historical datasets

Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.

Feature views are used during

The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.

If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).

If the schema parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply by creating a Field for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.

"Entity aliases" can be specified to join entity_dataframe columns that do not match the column names in the source table of a FeatureView.

This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.

It is suggested that you dynamically specify the new FeatureView name using .with_name and join_key_map override using .with_join_key_map instead of needing to register each new copy.

A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py.

Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:

Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .

Feature names must be unique within a .

Each field can have additional metadata associated with it, specified as key-value .

On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.

Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.

This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could

Call get_historical_features to generate a training dataframe
Iterate in notebook on feature engineering in Pandas
Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository

A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.

Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:

See for a example of how to use stream feature views to register your own streaming data pipelines in Feast.

Running Feast in production (e.g. on Kubernetes)

Overview

After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise (e.g. on Kubernetes).

A typical production architecture looks like:

Important note: Feast is highly customizable and modular.

Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.

For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.

In this guide we will show you how to:

Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date (from batch and stream sources)
Use Feast for model training and serving

The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running feast apply commits feature definitions to a registry, which users can then read elsewhere.

Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available .

Note: A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.

We recommend typically setting up CI/CD to automatically run feast plan and feast apply when pull requests are opened / merged.

A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set).

Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.

Different options are presented in the .

To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.

Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.

This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. In this case, we recommend using one of the more , such as the , or the . Users may also need to to work on their existing infrastructure.

The Bytewax materialization engine can run materialization on an existing Kubernetes cluster. An example configuration of this in a feature_store.yaml is as follows:

See also for code snippets

It is up to you to orchestrate and schedule runs of materialization.

Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently.

However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.

If you are using Airflow as a scheduler, Feast can be invoked through a after the has been installed into a virtual environment and your feature repo has been synced:

You can see more in an example at .

See more details at , which shows how to ingest streaming features or 3rd party feature data via a push API.

This supports pushing feature values into Feast to both online or offline stores.

Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See for an example and some tips.

For more details, see

After we've defined our features and data sources in the repository, we can generate training datasets. We highly recommend you use a FeatureService to version the features that go into a specific model version.

The first thing we need to do in our training code is to create a FeatureStore object with a path to the registry.
- One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml to those pipelines. This feature_store.yaml file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.

The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the feature service name and the model versions have some established convention.

For example, in MLflow:

Once you have successfully loaded data from batch / streaming sources into the online store, you can start consuming features for model inference.

This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.

To integrate online retrieval into your service use the following code:

To deploy a Feast feature server on Kubernetes, you can use the included (which also has detailed instructions and an example tutorial).

Basic steps

Install and
Add the Feast Helm repository and download the latest charts:

Run Helm Install

This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via ). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.

You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the ${ENV_VAR} syntax in your feature_store.yaml file. For instance:

It is possible to set a default value if the environment variable is not set, with ${ENV_VAR:"default"}. For instance:

In summary, the overall architecture in production may look like:

Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
Data ingestion
- Batch data: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine

Adding or reusing tests

Overview

This guide will go over:

how Feast tests are setup
how to extend the test suite to test new functionality
how to use the existing test suite to test a new custom offline / online store

Test suite overview

Unit tests are contained in sdk/python/tests/unit. Integration tests are contained in sdk/python/tests/integration. Let's inspect the structure of sdk/python/tests/integration:

feature_repos has setup files for most tests in the test suite.
conftest.py (in the parent directory) contains the most common , which are designed as an abstraction on top of specific offline/online stores, so tests do not need to be rewritten for different stores. Individual test files also contain more specific fixtures.
The tests are organized by which Feast component(s) they test.

The universal feature repo refers to a set of fixtures (e.g. environment and universal_data_sources) that can be parametrized to cover various combinations of offline stores, online stores, and providers. This allows tests to run against all these various combinations without requiring excess code. The universal feature repo is constructed by fixtures in conftest.py with help from the various files in feature_repos.

Tests in Feast are split into integration and unit tests. If a test requires external resources (e.g. cloud resources on GCP or AWS), it is an integration test. If a test can be run purely locally (where locally includes Docker resources), it is a unit test.

Integration tests test non-local Feast behavior. For example, tests that require reading data from BigQuery or materializing data to DynamoDB are integration tests. Integration tests also tend to involve more complex Feast functionality.
Unit tests test local Feast behavior. For example, tests that only require registering feature views are unit tests. Unit tests tend to only involve simple Feast functionality.

E2E tests
- E2E tests test end-to-end functionality of Feast over the various codepaths (initialize a feature store, apply, and materialize).
- The main codepaths include:

Registry Diff Tests
- These are tests for the infrastructure and registry diff functionality that Feast uses to determine if changes to the registry or infrastructure is needed.
Local CLI Tests and Local Feast Tests

Docstring tests are primarily smoke tests to make sure imports and setup functions can be executed without errors.

Let's look at a sample test using the universal repo:

The key fixtures are the environment and universal_data_sources fixtures, which are defined in the feature_repos directories and the conftest.py file. This by default pulls in a standard dataset with driver and customer entities (that we have pre-defined), certain feature views, and feature values.
- The environment fixture sets up a feature store, parametrized by the provider and the online/offline store. It allows the test to query against that feature store without needing to worry about the underlying implementation or any setup that may be involved in creating instances of these datastores.

Use the same function signatures as an existing test (e.g. use environment and universal_data_sources as an argument) to include the relevant test fixtures.
If possible, expand an individual test instead of writing a new test, due to the cost of starting up offline / online stores.
Use the universal_offline_stores

Install Feast in editable mode with pip install -e.
The core tests for offline / online store behavior are parametrized by the FULL_REPO_CONFIGS variable defined in feature_repos/repo_configuration.py. To overwrite this variable without modifying the Feast repo, create your own file that contains a FULL_REPO_CONFIGS (which will require adding a new IntegrationTestRepoConfig or two) and set the environment variable FULL_REPO_CONFIGS_MODULE

Many problems arise when implementing your data store's type conversion to interface with Feast datatypes.

You will need to correctly update inference.py so that Feast can infer your datasource schemas
You also need to update type_map.py so that Feast knows how to convert your datastores types to Feast-recognized types in feast/types.py.

The most important functionality in Feast is historical and online retrieval. Most of the e2e and universal integration test test this functionality in some way. Making sure this functionality works also indirectly asserts that reading and writing from your datastore works as intended.

Extend data_source_creator.py for your offline store.
In repo_configuration.py add a new IntegrationTestRepoConfig or two (depending on how many online stores you want to test).

This folder is for plugins that are officially maintained with community owners. Place the APIs in feast/infra/offline_stores/contrib/.
Extend data_source_creator.py for your offline store and implement the required APIs.
In contrib_repo_configuration.py

In repo_configuration.py add a new config that maps to a serialized version of configuration you need in feature_store.yaml to setup the online store.
In repo_configuration.py, add new IntegrationTestRepoConfig for online stores you want to test.

Check test_universal_types.py for an example of how to do this.

Install Redis on your computer. If you are a mac user, you should be able to brew install redis.
- Running redis-server --help and redis-cli --help should show corresponding help menus.

You should be able to run the integration tests and have the Redis cluster tests pass.
If you would like to run your own Redis cluster, you can run the above commands with your own specified ports and connect to the newly configured cluster.
To stop the cluster, run ./infra/scripts/redis-cluster.sh stop and then ./infra/scripts/redis-cluster.sh clean.