1 of 100

v0.26-branch

Introduction

What is Feast?

Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.

Feast allows ML platform teams to:

Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.

Note: Feast today primarily addresses timestamped structured data.

Who is Feast for?

Feast helps ML platform teams with DevOps experience productionize real-time models. Feast can also help these teams build towards a feature platform that improves collaboration between engineers and data scientists.

Feast is likely not the right tool if you

are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is
rely primarily on unstructured data
need very low latency feature retrieval (e.g. p99 feature retrieval << 10ms)

What Feast is not?

Feast is not

an / system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Users often leverage tools like to manage upstream data transformations.
a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like to make features consistently available.

Feast does not fully solve

reproducible model training / model backtesting / experiment management: Feast captures feature and model metadata, but does not version-control datasets / labels or manage train / test splits. Other tools like , , and are better suited for this.
batch + streaming feature engineering: Feast primarily processes already transformed feature values (though it offers experimental light-weight transformations). Users usually integrate Feast with upstream systems (e.g. existing ETL/ELT pipelines). is a more fully featured feature platform which addresses these needs.

Example use cases

Many companies have used Feast to power real-world ML use cases such as:

Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch

How can I get started?

The best way to learn Feast is to use it. Join our and head over to our and try it out!

Explore the following resources to get started with Feast:

is the fastest way to get started with Feast
describes all important Feast API concepts
describes Feast's overall architecture.

Community & getting help

Links & Resources

: Find the complete Feast codebase on GitHub.

Roadmap

The list below contains the functionality that contributors are planning to develop for Feast.

We welcome contribution to all items in the roadmap!
Have questions about the roadmap? Go to the Slack channel to ask on #feast-development.

Getting started

Concepts

Overview Data ingestion Entity Feature view Feature retrieval Point-in-time joins Registry [Alpha] Saved dataset

Overview

Feast project structure

The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.

Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev, staging, prod).

Data ingestion

For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.

For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page ()

Feature registration and retrieval

Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).

Feast supports several patterns of feature retrieval.

Use case

Example

API

[Alpha] Saved dataset

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

Dataset can be created from:

Results of historical retrieval

Architecture

Overview

Functionality

Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.

Registry

The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.

Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports four different backends.

Local: Used as a local backend for storing the registry during development
S3: Used as a centralized backend for storing the registry on AWS
GCS: Used as a centralized backend for storing the registry on GCP
[Alpha] Azure: Used as centralized backend for storing the registry on Azure Blob storage.

The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply from the Feast CLI, but metadata about objects can also be updated during operations like materialization.

Users interact with a feature registry through the Feast SDK. Listing all feature views:

Or retrieving a specific feature view:

The feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.

Offline store

An offline store is an interface for working with historical time-series feature values that are stored in data sources. The OfflineStore interface has several different implementations, such as the BigQueryOfflineStore, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see Offline Stores.

Offline stores are primarily used for two reasons:

Building training datasets from time-series features.
Materializing (loading) features into an online store to serve those features at low-latency in a production setting.

Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.

Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery offline store cannot be used to query a file-based data source.

Please see for more details on how to push features directly to the offline store in your feature store.

Online store

Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize command.

The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each entity key, only the latest feature values are stored. No historical values are stored.

Here is an example batch data source:

Once the above data source is materialized into Feast (using feast materialize), the feature values will be stored as follows:

Features can also be written directly to the online store via .

Batch Materialization Engine

A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.

A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).

If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see for more details.

Please see for configuring engines.

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

Third party integrations

We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.

Don't see your offline store or online store of choice here? Check out our guides to make a custom one!

Integrations

See

Standards

In order for a plugin integration to be highlighted, it must meet the following requirements:

The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).

In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:

The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.

Tutorials

Sample use-case tutorials

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

This tutorial guides you on how to use Feast with . You will learn how to:

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Fraud Detection Example

Our end-to-end example will perform the following workflows:

Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data

Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

This end-to-end tutorial will take you through the following steps:

Deploying S3 with Parquet as your primary data source, containing both and
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving

Building streaming features

Feast supports registering streaming feature views and Kafka and Kinesis streaming sources. It also provides an interface for stream processing called the Stream Processor. An example Kafka/Spark StreamProcessor is implemented in the contrib folder. For more details, please see the for more details.

Please see for a tutorial on how to build a versioned streaming pipeline that registers your transformations, features, and data sources in Feast.

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast Create a feature repository Deploy a feature store Build a training dataset Load data into the online store Read features from the online store Scaling Feast Structuring Feature Repos

Install Feast

Install Feast using :

Install Feast with Snowflake dependencies (required when using Snowflake):

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use feast init command:

The init

Build a training dataset

Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.

Retrieving historical features

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Scaling Feast

Overview

Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.

Scaling Feast Registry

The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).

Scaling Materialization

The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.

Feast supports pluggable , that allow the materialization process to be scaled up. Aside from the local process, Feast supports a , and a .

Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.

Reference

Data sources

Please see Data Source for a conceptual explanation of data sources.

Overview File Snowflake BigQuery Redshift Push Kafka Kinesis Spark (contrib)PostgreSQL (contrib)Trino (contrib)Azure Synapse + Azure SQL (contrib)

File

Description

File data sources are files on disk or on S3. Currently only Parquet files are supported.

FileSource is meant for development purposes only and is not optimized for production use.

Example

The full set of configuration options is available .

Supported Types

File data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Snowflake

Description

Snowflake data sources are Snowflake tables or views. These can be specified either by a table reference or a SQL query.

BigQuery

Description

BigQuery data sources are BigQuery tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table reference:

Using a query:

The full set of configuration options is available .

Supported Types

BigQuery data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Redshift

Description

Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table name:

Using a query:

The full set of configuration options is available .

Supported Types

Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

Spark (contrib)

Description

Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.

Disclaimer

The Spark data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):

Using a query:

Using a file reference:

The full set of configuration options is available .

Supported Types

Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

PostgreSQL (contrib)

Description

PostgreSQL data sources are PostgreSQL tables or views. These can be specified either by a table reference or a SQL query.

Trino (contrib)

Description

Trino data sources are Trino tables or views. These can be specified either by a table reference or a SQL query.

Disclaimer

The Trino data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Trino source:

The full set of configuration options is available .

Supported Types

Trino data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

Azure Synapse + Azure SQL (contrib)

Description

MsSQL data sources are Microsoft sql table sources. These can be specified either by a table reference or a SQL query.

Disclaimer

The MsSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a MsSQL source:

Offline stores

Please see for a conceptual explanation of offline stores.

Providers

Please see Provider for an explanation of providers.

Local Google Cloud Platform Amazon Web Services Azure

Local

Description

Offline Store: Uses the File offline store by default. Also supports BigQuery as the offline store.
Online Store: Uses the Sqlite online store by default. Also supports Redis and Datastore as online stores.

Example

Spark (contrib)

Description

The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via to_remote_storage, and then loads them into the online store.

See for configuration options.

Quickstart

In this tutorial we will

Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.
Read the latest features from the offline store for batch scoring
Read the latest features from the online store for real-time inference.
Explore the (experimental) Feast UI

Overview

In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:

Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
- Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Online feature availability:

Step 1: Install Feast

Install the Feast SDK and CLI using pip:

In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see

Step 2: Create a feature repository

Bootstrap a new feature repository using feast init from the command line.

Let's take a look at the resulting demo repo itself. It breaks down into

data/ contains raw demo parquet data
example_repo.py contains demo feature definitions
feature_store.yaml

The feature_store.yaml file configures the key overall architecture of the feature store.

The provider value sets default offline and online stores.

The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).
The online store is a low latency store of the latest feature values (for powering real-time inference).

Valid values for provider in feature_store.yaml are:

local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store
gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)
aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)

Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See for all supported data sources.

A custom setup can also be made by following .

Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

Step 3: Run sample workflow

There's an included test_workflow.py file which runs through a full sample workflow:

Register feature definitions through feast apply
Generate a training dataset (using get_historical_features)
Generate features for batch scoring (using get_historical_features

We'll walk through some snippets of code below and explain

Step 3a: Register feature definitions and deploy your feature store

The apply command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store in feature_store.yaml.

Step 3b: Generating training data or powering batch scoring models

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.

Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:

The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.
The user can also query that table with a SQL query which pulls entities. See the documentation on for details

Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.

Generating training data

Run offline inference (batch scoring)

To power a batch model, we primarily need to generate features with the get_historical_features call, but using the current timestamp

Step 3c: Ingest batch features into your online store

We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental serializes all new features since the last materialize call).

Step 3d: Fetching feature vectors for inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features(). These feature vectors can then be fed to the model.

Step 3e: Using a feature service to fetch online features instead.

You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found .

The driver_activity_v1 feature service pulls all features from the driver_hourly_stats feature view:

Step 4: Browse your features with the Web UI (experimental)

View all registered features, data sources, entities, and feature services with the Web UI.

One of the ways to view this is with the feast ui command.

Step 5: Re-examine `test_workflow.py`

Take a look at test_workflow.py again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.

Next steps

Join the to get new updates on Feast / feature stores.
Read the page to understand the Feast data model.
Read the page.

Codebase Structure

Let's examine the Feast codebase. This analysis is accurate as of Feast 0.23.

Python SDK

The Python SDK lives in sdk/python/feast. The majority of Feast logic lives in these Python files:

The core Feast objects (, , , etc.) are defined in their respective Python files, such as entity.py, feature_view.py, and data_source.py.
The FeatureStore class is defined in feature_store.py and the associated configuration object (the Python representation of the feature_store.yaml file) are defined in repo_config.py.
The CLI and other core feature store logic are defined in cli.py and repo_operations.py.
The type system that is used to manage conversion between Feast types and external typing systems is managed in type_map.py.
The Python feature server (the server that is started through the feast serve command) is defined in feature_server.py.

There are also several important submodules:

infra/ contains all the infrastructure components, such as the provider, offline store, online store, batch materialization engine, and registry.
dqm/ covers data quality monitoring, such as the dataset profiler.
diff/

Of these submodules, infra/ is the most important. It contains the interfaces for the , , , , and , as well as all of their individual implementations.

The tests for the Python SDK are contained in sdk/python/tests. For more details, see this of the test suite.

Example flow: `feast apply`

Let's walk through how feast apply works by tracking its execution across the codebase.

All CLI commands are in cli.py. Most of these commands are backed by methods in repo_operations.py. The feast apply command triggers apply_total_command, which then calls apply_total in repo_operations.py.

At this point, the feast apply command is complete.

Example flow: `feast materialize`

Let's walk through how feast materialize works by tracking its execution across the codebase.

The feast materialize command triggers materialize_command in cli.py, which then calls FeatureStore.materialize from feature_store.py.
This then calls Provider.materialize_single_feature_view

Example flow: `get_historical_features`

Let's walk through how get_historical_features works by tracking its execution across the codebase.

We start with FeatureStore.get_historical_features in feature_store.py. This method does some internal preparation, and then delegates the actual execution to the underlying provider by calling Provider.get_historical_features, which can be found in infra/provider.py.
As with feast apply, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.get_historical_features

Java SDK

The java/ directory contains the Java serving component. See for more details on how the repo is structured.

Go feature server

The go/ directory contains the Go feature server. Most of the files here have logic to help with reading features from the online store. Within go/, the internal/feast/ directory contains most of the core logic:

onlineserving/ covers the core serving logic.
model/ contains the implementations of the Feast objects (entity, feature view, etc.).

Protobufs

Feast uses to store serialized versions of the core Feast objects. The protobuf definitions are stored in protos/feast.

The consists of the serialized representations of the Feast objects.

Typically, changes being made to the Feast objects require changes to their corresponding protobuf representations. The usual best practices for making changes to protobufs should be followed ensure backwards and forwards compatibility.

Web UI

The ui/ directory contains the Web UI. See for more details on the structure of the Web UI.

Feature view

Feature views

Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.

A feature view is an object that represents a logical group of time-series feature data as it is found in a . Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see ).

Feature views consist of:

a
zero or more
- If the features are not related to a specific object, the feature view might not have entities; see below.

Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.

Feature views are used during

The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.

Feature views without entities

If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).

Feature inferencing

If the schema parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply by creating a Field for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.

Entity aliasing

"Entity aliases" can be specified to join entity_dataframe columns that do not match the column names in the source table of a FeatureView.

This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.

It is suggested that you dynamically specify the new FeatureView name using .with_name and join_key_map override using .with_join_key_map instead of needing to register each new copy.

Field

A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py.

Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:

Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .

Feature names must be unique within a .

Each field can have additional metadata associated with it, specified as key-value .

[Alpha] On demand feature views

On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.

Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.

Why use on demand feature views?

This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could

Call get_historical_features to generate a training dataframe
Iterate in notebook on feature engineering in Pandas
Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository

[Alpha] Stream feature views

A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.

Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:

See for a example of how to use stream feature views to register your own streaming data pipelines in Feast.

Feature retrieval

Overview

Generally, Feast supports several patterns of feature retrieval:

Training data generation (via feature_store.get_historical_features(...))
Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...))
Online feature retrieval for real-time model predictions
- via the SDK: feature_store.get_online_features(...)
- via deployed feature server endpoints: requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))

Each of these retrieval mechanisms accept:

some way of specifying entities (to fetch features for)
some way to specify the features to fetch (either via , which group features needed for a model version, or )

Before beginning, you need to instantiate a local FeatureStore object that knows how to parse the registry (see )

For code examples of how the below work, inspect the generated repository from feast init -t [YOUR TEMPLATE] (gcp, snowflake, and aws are the most fully fleshed).

Concepts

Before diving into how to retrieve features, we need to understand some high level concepts in Feast.

Feature Services

A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.

Feature services are used during

The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now())
Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.

Applying a feature service does not result in an actual service being deployed.

Feature services enable referencing all or some features from a feature view.

Retrieving from the online store with a feature service

Retrieving from the offline store with a feature service

Feature References

This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.

Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>

Feature references are used for the retrieval of features from Feast:

It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time.

Note, if you're using , then those features can be added here without additional entity values in the entity_rows parameter.

Event timestamp

The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.

Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.

Dataset

A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.

Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.

Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.

Retrieving historical features (for training data or batch scoring)

Feast abstracts away point-in-time join complexities with the get_historical_features API.

We go through the major steps, and also show example code. Note that the quickstart templates generally have end-to-end working examples for all these cases.

Full example: generate training data

Full example: retrieve offline features for batch scoring

The main difference here compared to training data generation is how to handle timestamps in the entity dataframe. You want to pass in the current time to get the latest feature values for all your entities.

Step 1: Specifying Features

Feast accepts either:

, which group features needed for a model version

Example: querying a feature service (recommended)

Example: querying a list of feature references

Step 2: Specifying Entities

Feast accepts either a Pandas dataframe as the entity dataframe (including entity keys and timestamps) or a SQL query to generate the entities.

Both approaches must specify the full entity key needed as well as the timestamps. Feast then joins features onto this dataframe.

Example: entity dataframe for generating training data

Example: entity SQL query for generating training data

You can also pass a SQL string to generate the above dataframe. This is useful for getting all entities in a timeframe from some data source.

Retrieving online features (for model inference)

Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of entities and the corresponding features to be retrieved. Similar to get_historical_features, we recommend using feature services as a mechanism for grouping features in a model version.

Note: unlike get_historical_features, the entity_rows do not need timestamps since you only want one feature value per entity key.

There are several options for retrieving online features: through the SDK, or through a feature server

Full example: retrieve online features for real-time model inference (Python SDK)

Full example: retrieve online features for real-time model inference (Feature Server)

This approach requires you to deploy a feature server (see ).

Running Feast in production (e.g. on Kubernetes)

Overview

After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise (e.g. on Kubernetes).

A typical production architecture looks like:

Important note: Feast is highly customizable and modular.

Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.

For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.

Additionally, please check the how-to guide for some specific recommendations on

In this guide we will show you how to:

Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date (from batch and stream sources)
Use Feast for model training and serving

1. Automatically deploying changes to your feature definitions

1.1 Setting up a feature repository

The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running feast apply commits feature definitions to a registry, which users can then read elsewhere.

1.2 Setting up a database-backed registry

Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available .

Note: A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.

1.3 Setting up CI/CD to automatically update the registry

We recommend typically setting up CI/CD to automatically run feast plan and feast apply when pull requests are opened / merged.

1.4 Setting up multiple environments

A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set).

Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.

Different options are presented in the .

2. How to load data into your online store and keep it up to date

To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.

2.1 Scalable Materialization

Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.

This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. In this case, we recommend using one of the more , such as the , or the . Users may also need to to work on their existing infrastructure.

The Bytewax materialization engine can run materialization on an existing Kubernetes cluster. An example configuration of this in a feature_store.yaml is as follows:

2.2 Scheduled materialization with Airflow

See also for code snippets

It is up to you to orchestrate and schedule runs of materialization.

Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently.

However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.

If you are using Airflow as a scheduler, Feast can be invoked through a after the has been installed into a virtual environment and your feature repo has been synced:

You can see more in an example at .

Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.

2.3 Stream feature ingestion

See more details at , which shows how to ingest streaming features or 3rd party feature data via a push API.

This supports pushing feature values into Feast to both online or offline stores.

2.4 Scheduled batch transformations with Airflow + dbt

Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See for an example and some tips.

3. How to use Feast for model training

3.1. Generating training data

For more details, see

After we've defined our features and data sources in the repository, we can generate training datasets. We highly recommend you use a FeatureService to version the features that go into a specific model version.

The first thing we need to do in our training code is to create a FeatureStore object with a path to the registry.
- One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml to those pipelines. This feature_store.yaml file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.

3.2 Versioning features that power ML models

The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the feature service name and the model versions have some established convention.

For example, in MLflow:

It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.

4. Retrieving online features for prediction

Once you have successfully loaded data from batch / streaming sources into the online store, you can start consuming features for model inference.

4.1. Use the Python SDK within an existing Python service

This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.

To integrate online retrieval into your service use the following code:

4.2. Deploy Feast feature servers on Kubernetes

To deploy a Feast feature server on Kubernetes, you can use the included (which also has detailed instructions and an example tutorial).

Basic steps

Install and
Add the Feast Helm repository and download the latest charts:

Run Helm Install

This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via ). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.

5. Using environment variables in your yaml configuration

You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the ${ENV_VAR} syntax in your feature_store.yaml file. For instance:

It is possible to set a default value if the environment variable is not set, with ${ENV_VAR:"default"}. For instance:

Summary

In summary, the overall architecture in production may look like:

Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
Data ingestion
- Batch data: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine

Adding a new offline store

Overview

Feast makes adding support for a new offline store easy. Developers can simply implement the OfflineStore interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).

In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.

The full working code for this guide can be found at .

The process for using a custom offline store consists of 8 steps:

Defining an OfflineStore class.
Defining an OfflineStoreConfig class.
Defining a RetrievalJob

1. Defining an OfflineStore class

OfflineStore class names must end with the OfflineStore suffix!

Contrib offline stores

New offline stores go in sdk/python/feast/infra/offline_stores/contrib/.

What is a contrib plugin?

Not guaranteed to implement all interface methods
Not guaranteed to be stable.
Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.

How do I make a contrib plugin an "official" plugin?

To move an offline store plugin out of contrib, you need:

GitHub actions (i.e make test-python-integration) is setup to run all tests against the offline store and pass.
At least two contributors own the plugin (ideally tracked in our OWNERS / CODEOWNERS file).

Define the offline store class

The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.

To fully implement the interface for the offline store, you will need to implement these methods:

pull_latest_from_table_or_query is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method. This method pull data from the offline store, and the FeatureStore class takes care of writing this data into the online store.
get_historical_features

1.1 Type Mapping

Most offline stores will have to perform some custom mapping of offline store datatypes to feast value types.

The function to implement here are source_datatype_to_feast_value_type and get_column_names_and_types in your DataSource class.
source_datatype_to_feast_value_type is used to convert your DataSource's datatypes to feast value types.

Add any helper functions for type conversion to sdk/python/feast/type_map.py.

Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data.

2. Defining an OfflineStoreConfig class

Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.

To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

This config class must container a type field, which contains the fully qualified class name of its corresponding OfflineStore class.

Additionally, the name of the config class must be the same as the OfflineStore class, with the Config suffix.

An example of the config class for the custom file offline store :

This configuration can be specified in the feature_store.yaml as follows:

This configuration information is available to the methods of the OfflineStore, via the config: RepoConfig parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store field of the config parameter. This fields in the feature_store.yaml should map directly to your OfflineStoreConfig class that is detailed above in Section 2.

3. Defining a RetrievalJob class

The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob instance, which represents the execution of the actual query against the underlying store.

Custom offline stores may need to implement their own instances of the RetrievalJob interface.

The RetrievalJob interface exposes two methods - to_df and to_arrow. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.

Users who want to have their offline store support scalable batch materialization for online use cases (detailed in this ) will also need to implement to_remote_storage to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize).

4. Defining a DataSource class for the offline store

Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.

The data source class should implement two methods - from_proto, and to_proto.

For custom offline stores that are not being implemented in the main feature repo, the custom_options field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto method and reading the value back from bytes in the from_proto method.

5. Using the custom offline store

After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the offline_store field. The value specified should be the fully qualified class name of the OfflineStore.

As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our custom file offline store, we can use the following feature_store.yaml:

If additional configuration for the offline store is not required, then we can omit the other fields and only specify the type of the offline store class as the value for the offline_store.

Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.

6. Testing the OfflineStore class

Integrating with the integration test suite and unit test suite.

Even if you have created the OfflineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.

In order to test against the test suite, you need to create a custom DataSourceCreator that implement our testing infrastructure methods, create_data_source and optionally, created_saved_dataset_destination.
- create_data_source should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See

7. Dependencies

Add any dependencies for your offline store to our sdk/python/setup.py under a new <OFFLINE_STORE>__REQUIRED list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands:

8. Add Documentation

Remember to add documentation for your offline store.

Add a new markdown file to docs/reference/offline-stores/ and docs/reference/data-sources/. Use these files to document your offline store functionality similar to how the other offline stores are documented.
You should also add a reference in docs/reference/data-sources/README.md and docs/SUMMARY.md to these markdown files.

NOTE: Be sure to document the following things about your offline store:

How to create the datasource and most what configuration is needed in the feature_store.yaml file in order to create the datasource.
Make sure to flag that the datasource is in alpha development.
Add some documentation on what the data model is for the specific offline store for more clarity.

v0.26-branch

Introduction

hashtagWhat is Feast?

hashtagWho is Feast for?

hashtagWhat Feast is not?

hashtagFeast is not

hashtagFeast does not fully solve

hashtagExample use cases

hashtagHow can I get started?

Community & getting help

hashtagLinks & Resources

Roadmap

Getting started

Concepts

Overview

hashtagFeast project structure

hashtagData ingestion

hashtagFeature registration and retrieval

[Alpha] Saved dataset

Architecture

Overview

hashtagFunctionality

Registry

Offline store

Online store

Batch Materialization Engine

Provider

Third party integrations

hashtagIntegrations

hashtagStandards

Tutorials

Sample use-case tutorials

Driver ranking

Fraud detection on GCP

Real-time credit scoring on AWS

Building streaming features

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast

Create a feature repository

Build a training dataset

hashtagRetrieving historical features

Read features from the online store

Scaling Feast

hashtagOverview

hashtagScaling Feast Registry

hashtagScaling Materialization

Reference

Data sources

File

hashtagDescription

hashtagExample

hashtagSupported Types

Snowflake

hashtagDescription

BigQuery

hashtagDescription

hashtagExamples

hashtagSupported Types

Redshift

hashtagDescription

hashtagExamples

hashtagSupported Types

Spark (contrib)

hashtagDescription

hashtagDisclaimer

hashtagExamples

hashtagSupported Types

PostgreSQL (contrib)

hashtagDescription

Trino (contrib)

hashtagDescription

hashtagDisclaimer

hashtagExamples

hashtagSupported Types

Azure Synapse + Azure SQL (contrib)

hashtagDescription

hashtagDisclaimer

hashtagExamples

Offline stores

What is Feast?

Who is Feast for?

What Feast is not?

Feast is not

Feast does not fully solve

Example use cases

How can I get started?

Links & Resources

Feast project structure

Data ingestion

Feature registration and retrieval

Functionality

Integrations

Standards

Retrieving historical features

Overview

Scaling Feast Registry

Scaling Materialization

Description

Example

Supported Types

Description

Description

Examples

Supported Types

Description

Examples

Supported Types

Description

Disclaimer

Examples

Supported Types

Description

Description

Disclaimer

Examples

Supported Types

Description

Disclaimer

Examples

Description

Example

Description

Feast project structure

Data ingestion

Feature registration and retrieval

Links & Resources

How can I get help?

Community Calls

General community call (biweekly)

Frequency (every 2 weeks)

Links

Developers call (biweekly)

Frequency (every 2 weeks)

Links

What is Feast?

Who is Feast for?

What Feast is not?

Feast is not

Feast does not fully solve

Example use cases

How can I get started?

Integrations

Standards

Overview

Scaling Feast Registry

Scaling Materialization

Description

Example

Supported Types

Description

Examples

Supported Types