Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The list below contains the functionality that contributors are planning to develop for Feast
Items below that are in development (or planned for development) will be indicated in parentheses.
We welcome contribution to all items in the roadmap!
Want to influence our roadmap and prioritization? Submit your feedback to this form.
Want to speak to a Feast contributor? We are more than happy to jump on a call. Please schedule a time using Calendly.
Data Sources
Offline Stores
Online Stores
Streaming
Feature Engineering
Deployments
Feature Serving
Data Quality Management (See RFC)
Feature Discovery and Governance
Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Feast is able to serve feature data to models from a low-latency online store (for real-time prediction) or from an offline store (for scale-out batch scoring or model training).
Models need consistent access to data: Machine Learning (ML) systems built on traditional data infrastructure are often coupled to databases, object stores, streams, and files. A result of this coupling, however, is that any change in data infrastructure may break dependent ML systems. Another challenge is that dual implementations of data retrieval for training and serving can lead to inconsistencies in data, which in turn can lead to training-serving skew.
Feast decouples your models from your data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval. Feast also provides a consistent means of referencing feature data for retrieval, and therefore ensures that models remain portable when moving from training to serving.
Deploying new features into production is difficult: Many ML teams consist of members with different objectives. Data scientists, for example, aim to deploy features into production as soon as possible, while engineers want to ensure that production systems remain stable. These differing objectives can create an organizational friction that slows time-to-market for new features.
Feast addresses this friction by providing both a centralized registry to which data scientists can publish features and a battle-hardened serving layer. Together, these enable non-engineering teams to ship features into production with minimal oversight.
Models need point-in-time correct data: ML models in production require a view of data consistent with the one on which they are trained, otherwise the accuracy of these models could be compromised. Despite this need, many data science projects suffer from inconsistencies introduced by future feature values being leaked to models during training.
Feast solves the challenge of data leakage by providing point-in-time correct feature retrieval when exporting feature datasets for model training.
Features aren't reused across projects: Different teams within an organization are often unable to reuse features across projects. The siloed nature of development and the monolithic design of end-to-end ML systems contribute to duplication of feature creation and usage across teams and projects.
Feast addresses this problem by introducing feature reuse through a centralized registry. This registry enables multiple teams working on different projects not only to contribute features, but also to reuse these same features. With Feast, data scientists can start new ML projects by selecting previously engineered features from a centralized registry, and are no longer required to develop new features for each project.
Feature engineering: We aim for Feast to support light-weight feature engineering as part of our API.
Feature discovery: We also aim for Feast to include a first-class user interface for exploring and discovering entities and features.
Feature validation: We additionally aim for Feast to improve support for statistics generation of feature data and subsequent validation of these statistics. Current support is limited.
ETL or ELT system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Feast plans to include a light-weight feature engineering toolkit, but we encourage teams to integrate Feast with upstream ETL/ELT systems that are specialized in transformation.
Data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
Data catalog: Feast is not a general purpose data catalog for your organization. Feast is purely focused on cataloging features for use in ML pipelines or systems, and only to the extent of facilitating the reuse of features.
The best way to learn Feast is to use it. Head over to our Quickstart and try it out!
Explore the following resources to get started with Feast:
Quickstart is the fastest way to get started with Feast
Concepts describes all important Feast API concepts
Architecture describes Feast's overall architecture.
Tutorials shows full examples of using Feast in machine learning applications.
Running Feast with Snowflake/GCP/AWS provides a more in-depth guide to using Feast.
Reference contains detailed API and design documents.
Contributing contains resources for anyone who wants to contribute to Feast.
A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.
Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.
Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.
Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>
Feature references are used for the retrieval of features from Feast:
It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, It is not possible to reference (or retrieve) features from multiple projects at the same time.
The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.
Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.
Feature services are used during
The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features from the online store. The features retrieved from the online store may also belong to multiple feature views.
Applying a feature service does not result in an actual service being deployed.
Feature services can be retrieved from the feature store, and referenced when retrieving features from the online store.
Feature services can also be used when retrieving historical features from the offline store.
Feature views are used during
The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
Feast does not generate feature values. It acts as the ingestion and serving system. The data sources described within feature views should reference feature values in their already computed form.
If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only event timestamps are needed for this feature view).
If the features
parameter is not specified in the feature view creation, Feast will infer the features during feast apply
by creating a feature for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.
"Entity aliases" can be specified to join entity_dataframe
columns that do not match the column names in the source table of a FeatureView.
This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
It is suggested that you dynamically specify the new FeatureView name using .with_name
and join_key_map
override using .with_join_key_map
instead of needing to register each new copy.
A feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer
entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month.
Features are defined as part of feature views. Since Feast does not transform data, a feature is essentially a schema that only contains a name and a type:
On demand feature views allows users to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths:
Dataset can be created from:
Results of historical retrieval
[planned] Logging features during writing to online store (from batch source or stream)
To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features
method first and then pass the returned retrieval job to create_saved_dataset
method. create_saved_dataset
will trigger provided retrieval job (by calling .persist()
on it) to store the data using specified storage
. Storage type must be the same as globally configured offline store (eg, it's impossible to persist data to Redshift with BigQuery source). create_saved_dataset
will also create SavedDataset object with all related metadata and will write it to the registry.
Saved dataset can be later retrieved using get_saved_dataset
method:
Note, if you're using , then those features can be added here without additional entity values in the entity_rows
A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model, allowing for tracking of the features used by models.
A feature view is an object that represents a logical group of time-series feature data as it is found in a . Feature views consist of zero or more , one or more , and a . Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities; see below.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .
Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .
Feature names must be unique within a .
Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.
Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .
[planned] Logging request (including input for ) and response during feature serving
Check out our to see how this concept can be applied in real-world use case.
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp
provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local
, gcp
, and aws
) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp
provider but use Redis as the online store instead of Datastore.
If the built-in providers are not sufficient, you can create your own custom provider. Please see this guide for more details.
Please see feature_store.yaml for configuring providers.
The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.
Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports three different backends
Local
: Used as a local backend for storing the registry during development
S3
: Used as a centralized backend for storing the registry on AWS
GCS
: Used as a centralized backend for storing the registry on GCP
The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply
from the Feast CLI, but metadata about objects can also be updated during operations like materialization.
Users interact with a feature registry through the Feast SDK. Listing all feature views:
Or retrieving a specific feature view:
The feature registry is a Protobuf representation of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Materialize feature values from the offline store into the online store.
Read the latest features from the online store for inference.
You can run this tutorial in Google Colab or run it on your localhost, following the guided steps below.
In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Feast alerts users to offline / online skew with data quality monitoring
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature reusability and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
Feast enables feature transformation so users can re-use transformation logic across online / offline usecases and across models.
Install the Feast SDK and CLI using pip:
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see Running Feast with Snowflake/GCP/AWS
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
The key line defining the overall architecture of the feature store is the provider. This defines where the raw data exists (for generating training data & feature values for serving), and where to materialize feature values to in the online store (for serving).
Valid values for provider
in feature_store.yaml
are:
local: use file source with SQLite/Redis
gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis
aws: use Redshift/Snowflake with DynamoDB/Redis
Note that there are many other sources Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See Third party integrations for all supported datasources.
A custom setup can also be made by following adding a custom provider.
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example.py
(shown again below for convenience) and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by using the local
provider in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values).
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation. In many cases, Feast will also intelligently join relevant tables to create the relevant feature vectors.
Note that we include timestamps because want the features for the same driver at various timestamps to be used in a model.
We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental
serializes all new features since the last materialize
call).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
Read the Concepts page to understand the Feast data model.
Read the Architecture page.
Check out our Tutorials section for more examples on how to use Feast.
Follow our Running Feast with Snowflake/GCP/AWS guide for a more in-depth tutorial on using Feast.
Join other Feast users and contributors in Slack and become part of the community!
The Feast online store is used for low-latency online feature value lookups. Feature values are loaded into the online store from data sources in feature views using the materialize
command.
The storage schema of features within the online store mirrors that of the data source used to populate the online store. One key difference between the online store and data sources is that only the latest feature values are stored per entity key. No historical values are stored.
Example batch data source
Once the above data source is materialized into Feast (using feast materialize
), the feature values will be stored as follows:
Don't see your question?
Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.
Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).
A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.
Yes. Specifically:
Simple lists / dense embeddings:
BigQuery supports list types natively
Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)
Sparse embeddings (e.g. one hot encodings)
Yes. Using a GCP or AWS provider in feature_store.yaml
primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.
Yes. There are two ways to use S3 in Feast:
Using the s3_endpoint_override
in a FileSource
data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.
These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.
A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.
Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.
Our end-to-end example will perform the following workflows:
Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data
Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):
Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.
In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.
Try it and let us know what you think!
Install Feast with Snowflake dependencies (required when using Snowflake):
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
Install Feast with AWS dependencies (required when using Redshift or DynamoDB):
Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):
Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.
When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.
In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.
This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.
This end-to-end tutorial will take you through the following steps:
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving
Building a training dataset with Feast to train your credit scoring model
Loading feature values from S3 into DynamoDB
Making online predictions with your credit scoring model using features from DynamoDB
To have Feast deploy your infrastructure, run feast apply
from your command line while inside a feature repository:
Depending on whether the feature repository is configured to use a local
provider or one of the cloud providers like GCP
or AWS
, it may take from a couple of seconds to a minute to run to completion.
If you need to clean up the infrastructure created by feast apply
, use the teardown
command.
Warning: teardown
is an irreversible command and will remove all feature store infrastructure. Proceed with caution!
****
Features can also be written to the online store via
We encourage you to ask questions on or . Even better, once you get an answer, add the answer to this FAQ via a !
The is the easiest way to learn about Feast. For more detailed tutorials, please check out the page.
Feature tables from Feast 0.9 have been renamed to feature views in Feast 0.10+. For more details, please see the discussion .
No, there are .
The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see and for more details.
Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support .
Please see a detailed comparison of Feast vs. Tecton . For another comparison, please see .
Feast is designed to work at scale and support low latency online serving. Benchmarks () will be released soon, and active work is underway to support very latency sensitive use cases.
Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see )
One way to do this efficiently is to have a protobuf or string representation of
The list of supported offline and online stores can be found and , respectively. The indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about custom providers .
Please follow the instructions .
Yes. For example, the Postgres can be used as both an offline and online store.
Using Redshift as a data source via Spectrum (), and then continuing with the guide. See a we did on this at our apply() meetup.
Feast does not support Spark natively. However, you can create a that will support Spark, which can help with more scalable materialization and ingestion.
Please see the .
Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this for more details.
Please see this . If you have any questions or suggestions, feel free to leave a comment on the document!
For more details on contributing to the Feast community, see and this .
Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the .
This tutorial guides you on how to use Feast with . You will learn how to:
Train a model locally (on your laptop) using data from
Test the model for online inference using (for fast iteration)
Test the model for online inference using (for production use)
Install Feast using :
Deploying S3 with Parquet as your primary data source, containing both and
The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider
that has been configured in your file, as well as the feature definitions found in your feature repository.
Here we'll be using the example repository we created in the previous guide, . You can re-create it by running feast init
in a new directory.
At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize
. See for more details.
A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See Feature Repository for a detailed explanation of feature repositories.
The easiest way to create a new feature repository to use feast init
command:
The init
command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:
Enter the directory:
You can now use this feature repository for development. You can try the following:
Run feast apply
to apply these definitions to Feast.
Edit the example feature definitions in example.py
and run feast apply
again to change feature definitions.
Initialize a git repository in the same directory and checking the feature repository into version control.
After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise.
Overview of typical production configuration is given below:
Important note: We're trying to keep Feast modular. With the exception of the core, most of the Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration. For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features.
Furthermore, there's no single "true" approach. As you will see in this guide, Feast usually provides several options for each problem. It's totally up to you to pick a path that's better suited to your needs.
In this guide we will show you how to:
Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date
Use Feast for model training and serving
Ingest features from a stream source
Monitor your production deployment
The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits.
Most teams will need to have a feature store deployed to more than one environment. We have created an example repository (Feast Repository Example) which contains two Feast projects, one per environment.
The contents of this repository are shown below:
The repository contains three sub-folders:
staging/
: This folder contains the staging feature_store.yaml
and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.
production/
: This folder contains the production feature_store.yaml
and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.
.github
: This folder is an example of a CI system that applies the changes in either the staging
or production
repositories using feast apply
. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.
The feature_store.yaml
contains the following:
Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply
are tracked in the registry.db
. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.
It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.
If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:
In summary, once you have set up a Git based repository with CI that runs feast apply
on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.
To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.
The simplest way to schedule materialization is to run an incremental materialization using the Feast CLI:
The above command will load all feature values from all feature view sources into the online store up to the time 2022-01-01T00:00:00
.
A timestamp is required to set the end date for materialization. If your source is fully up to date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time materialize-incremental
is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
An alternative approach to incremental materialization (where Feast tracks the intervals of data that need to be ingested), is to call Feast directly from your scheduler like Airflow. In this case, Airflow is the system that tracks the intervals that have been ingested.
In the above example we are materializing the source data from the driver_hourly_stats
feature view over a day. This command can be scheduled as the final operation in your Airflow ETL, which runs after you have computed your features and stored them in the source location. Feast will then load your feature data into your online store.
The timestamps above should match the interval of data that has been computed by the data transformation system.
It is up to you which orchestration/scheduler to use to periodically run $ feast materialize
. Feast keeps the history of materialization in its registry so that the choice could be as simple as a unix cron util. Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
If you are using Airflow as a scheduler, Feast can be invoked through the BashOperator after the Python SDK has been installed into a virtual environment and your feature repo has been synced:
Important note: Airflow worker must have read and write permissions to the registry file on GS / S3 since it pulls configuration and updates materialization history.
After we've defined our features and data sources in the repository, we can generate training datasets.
The first thing we need to do in our training code is to create a FeatureStore
object with a path to the registry.
One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml
to those pipelines. This feature_store.yaml
file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
Then, training data can be retrieved as follows:
The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the list of feature references also be saved alongside the model. This ensures that models and the features they are trained on are paired together when being shipped into production:
To test your model locally, you can simply create a FeatureStore
object, fetch online features, and then make a prediction:
It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.
Once you have successfully loaded (or in Feast terminology materialized) your data from batch sources into the online store, you can start consuming features for model inference. There are three approaches for that purpose sorted from the most simple one (in an operational sense) to the most performant (benchmarks to be published soon):
This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.
To integrate online retrieval into your service use the following code:
If you don't want to add the Feast Python SDK as a dependency, or your feature retrieval service is written in a non-Python language, Feast can deploy a simple feature server on serverless infrastructure (eg, AWS Lambda, Google Cloud Run) for you. This service will provide an HTTP API with JSON I/O, which can be easily used with any programming language.
For users with very latency-sensitive and high QPS use-cases, Feast offers a high-performance Java feature server. Besides the benefits of running on JVM, this implementation also provides a gRPC API, which guarantees good connection utilization and small request / response body size (compared to JSON). You will need the Feast Java SDK to retrieve features from this service. This SDK wraps all the gRPC logic for you and provides more convenient APIs.
The Java based feature server can be deployed to Kubernetes cluster via Helm charts in a few simple steps:
Add the Feast Helm repository and download the latest charts:
Run Helm Install
This chart will deploy two services: feature-server
and transformation-service
. Both must have read access to the registry file on cloud storage. Both will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
The next step would be to install an L7 Load Balancer (eg, Envoy) in front of the Java feature server. For seamless integration with Kubernetes (including services created by Feast Helm chart) we recommend using Istio as Envoy's orchestrator.
Recently Feast added functionality for stream ingestion. Please note that this is still in an early phase and new incompatible changes may be introduced.
The default option to write features from a stream is to add the Python SDK into your existing PySpark / Beam pipeline. Feast SDK provides writer implementation that can be called from foreachBatch
stream writer in PySpark like this:
Alternatively, if you want to ingest features directly from a broker (eg, Kafka or Kinesis), you can use the "push service", which will write to an online store. This service will expose an HTTP API or when deployed on Serverless platforms like AWS Lambda or Google Cloud Run, this service can be directly connected to Kinesis or PubSub.
If you are using Kafka, HTTP Sink could be utilized as a middleware. In this case, the "push service" can be deployed on Kubernetes or as a Serverless function.
Feast services can report their metrics to a StatsD-compatible collector. To activate this function, you'll need to provide a StatsD IP address and a port when deploying the helm chart (in future, this will be added to feature_store.yaml
).
We use an InfluxDB-style extension for StatsD format to be able to send tags along with metrics. Keep that in mind while selecting the collector (telegraph will work for sure).
We chose StatsD since it's a de-facto standard with various implementations (eg, 1, 2) and metrics can be easily exported to Prometheus, InfluxDB, AWS CloudWatch, etc.
Summarizing it all together we want to show several options of architecture that will be most frequently used in production:
Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast registry
Airflow manages materialization jobs to ingest data from DWH to the online store periodically
For the stream ingestion Feast Python SDK is used in the existing Spark / Beam pipeline
Online features are served via either a Python feature server or a high performance Java feature server
Both the Java feature server and the transformation server are deployed on Kubernetes cluster (via Helm charts)
Feast Python SDK is called locally to generate a training dataset
Same as Option #1, except:
Push service is deployed as AWS Lambda / Google Cloud Run and is configured as a sink for Kinesis or PubSub to ingest features directly from a stream broker. Lambda / Cloud Run is being managed by Feast SDK (from CI environment)
Materialization jobs are managed inside Kubernetes via Kubernetes Job (currently not managed by Helm)
Same as Option #2, except:
Push service is deployed on Kubernetes cluster and exposes an HTTP API that can be used as a sink for Kafka (via kafka-http connector) or accessed directly.
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
BigQuery data sources allow for the retrieval of historical feature values from BigQuery for building training datasets as well as materializing features into an online store.
Either a table reference or a SQL query can be provided.
No performance guarantees can be provided over SQL query-based sources. Please use table references where possible.
Using a table reference
Using a query
Configuration options are available .
The File offline store provides support for reading FileSources.
Only Parquet files are currently supported.
All data is downloaded and joined using Python and may not scale to production workloads.
Configuration options are available here.
The Spark offline store is an offline store currently in alpha development that provides support for reading SparkSources.
This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.
Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
A SparkRetrievalJob
is returned when calling get_historical_features()
.
This allows you to call
to_df
to retrieve the pandas dataframe.
to_arrow
to retrieve the dataframe as a pyarrow Table.
to_spark_df
to retrieve the dataframe the spark.
NOTE: Spark data source api is currently in alpha development and the API is not completely stable. The API may change or update in the future.
The spark data source API allows for the retrieval of historical feature values from file/database sources for building training datasets as well as materializing features into an online store.
Either a table name, a SQL query, or a file path can be provided.
Using a table reference from SparkSession(for example, either in memory or a Hive Metastore)
Using a query
Using a file reference
.feastignore
is a file that is placed at the root of the Feature Repository. This file contains paths that should be ignored when running feast apply
. An example .feastignore
is shown below:
.feastignore
file is optional. If the file can not be found, every Python file in the feature repo directory will be parsed by feast apply
.
venv
venv/foo.py venv/a/foo.py
You can specify a path to a specific directory. Everything in that directory will be ignored.
scripts/foo.py
scripts/foo.py
You can specify a path to a specific file. Only that file will be ignored.
scripts/*.py
scripts/foo.py scripts/bar.py
You can specify an asterisk (*) anywhere in the expression. An asterisk matches zero or more characters, except "/".
scripts/**/foo.py
scripts/foo.py scripts/a/foo.py scripts/a/b/foo.py
You can specify a double asterisk (**) anywhere in the expression. A double asterisk matches zero or more directories.
The following top-level configuration options exist in the feature_store.yaml
file.
provider — Configures the environment in which Feast will deploy and operate.
registry — Configures the location of the feature registry.
online_store — Configures the online store.
offline_store — Configures the offline store.
project — Defines a namespace for the entire feature store. Can be used to isolate multiple deployments in a single installation of Feast. Should only contain letters, numbers, and underscores.
Currently, this component only supports online serving and does not have an offline component including APIs to create feast feature repositories or apply configuration to the registry to facilitate online materialization. It also does not expose its own dedicated cli to perform feast actions. Furthermore, this component is only meant to expose an online serving API that can be called through the python SDK to facilitate faster online feature retrieval.
As long as you are running macOS or linux, on x86, with python version 3.7-3.10, the go component comes pre-compiled when you install feast.
However, some additional dependencies are required for Go <-> Python interoperability. To install these dependencies run the following command in your console:
For developers, if you want to build from source, run make compile-go-lib
to build and compile the go server.
To enable the Go online feature retrieval component, set go_feature_retrieval: True
in your feature_store.yaml
. This will direct all online feature retrieval to Go instead of Python. This flag will be enabled by default in the future.
We also plan on adding support for the Java feature server (e.g. the capability to call into the Go component and execute Java UDFs).
feature_store.yaml
is used to configure a feature store. The file must be located at the root of a . An example feature_store.yaml
is shown below:
Please see the API reference for the full list of configuration options.
The Go Feature Retrieval component is a Go implementation of the core feature serving logic, embedded in the Python SDK. It supports retrieval of feature references, feature services, and on demand feature views, and can be used either through the Python SDK or the .
The Go Feature Retrieval component currently only supports Redis and Sqlite as online stores; support for other online stores will be added soon. Initial benchmarks indicate that it is significantly faster than the Python feature server for online feature retrieval. We plan to release a more comprehensive set of benchmarks. For more details, see the .
The Go feature retrieval online feature logging for Data Quality Monitoring is currently in development. More information can be found .
The Feast project logs anonymous usage statistics and errors in order to inform our planning. Several client methods are tracked, beginning in Feast 0.9. Users are assigned a UUID which is sent along with the name of the method, the Feast version, the OS (using sys.platform
), and the current time.
The source code is available here.
Set the environment variable FEAST_USAGE
to False
.
For Feast maintainers, these are the concrete steps for making a new release.
For new major or minor release, create and check out the release branch for the new stream, e.g. v0.6-branch
. For a patch version, check out the stream's release branch.
Update the CHANGELOG.md. See the Creating a change log guide and commit
Make to review each PR in the changelog to flag any breaking changes and deprecation.
Update versions for the release/release candidate with a commit:
In the root pom.xml
, remove -SNAPSHOT
from the <revision>
property, update versions, and commit.
Tag the commit with the release version, using a v
and sdk/go/v
prefixes
for a release candidate, create tags vX.Y.Z-rc.N
and sdk/go/vX.Y.Z-rc.N
for a stable release X.Y.Z
create tags vX.Y.Z
and sdk/go/vX.Y.Z
Check that versions are updated with make lint-versions
.
If changes required are flagged by the version lint, make the changes, amend the commit and move the tag to the new commit.
Push the commits and tags. Make sure the CI passes.
If the CI does not pass, or if there are new patches for the release fix, repeat step 2 & 3 with release candidates until stable release is achieved.
Bump to the next patch version in the release branch, append -SNAPSHOT
in pom.xml
and push.
Create a PR against master to:
Bump to the next major/minor version and append -SNAPSHOT
.
Add the change log by applying the change log commit created in step 2.
Check that versions are updated with env TARGET_MERGE_BRANCH=master make lint-versions
Create a GitHub release which includes a summary of important changes as well as any artifacts associated with the release. Make sure to include the same change log as added in CHANGELOG.md. Use Feast vX.Y.Z
as the title.
When a tag that matches a Semantic Version string is pushed, CI will automatically build and push the relevant artifacts to their repositories or package managers (docker images, Python wheels, etc). JVM artifacts are promoted from Sonatype OSSRH to Maven Central, but it sometimes takes some time for them to be available. The sdk/go/v tag
is required to version the Go SDK go module so that users can go get a specific tagged release of the Go SDK.
We use an open source change log generator to generate change logs. The process still requires a little bit of manual effort.
Create a GitHub token as per these instructions. The token is used as an input argument (-t
) to the change log generator.
The change log generator configuration below will look for unreleased changes on a specific branch. The branch will be master
for a major/minor release, or a release branch (v0.4-branch
) for a patch release. You will need to set the branch using the --release-branch
argument.
You should also set the --future-release
argument. This is the version you are releasing. The version can still be changed at a later date.
Update the arguments below and run the command to generate the change log to the console.
Review each change log item.
Make sure that sentences are grammatically correct and well formatted (although we will try to enforce this at the PR review stage).
Make sure that each item is categorised correctly. You will see the following categories: Breaking changes
, Implemented enhancements
, Fixed bugs
, and Merged pull requests
. Any unlabelled PRs will be found in Merged pull requests
. It's important to make sure that any breaking changes
, enhancements
, or bug fixes
are pulled up out of merged pull requests
into the correct category. Housekeeping, tech debt clearing, infra changes, or refactoring do not count as enhancements
. Only enhancements a user benefits from should be listed in that category.
Make sure that the "Full Change log" link is actually comparing the correct tags (normally your released version against the previously version).
Make sure that release notes and breaking changes are present.
It's important to flag breaking changes and deprecation to the API for each release so that we can maintain API compatibility.
Developers should have flagged PRs with breaking changes with the compat/breaking
label. However, it's important to double check each PR's release notes and contents for changes that will break API compatibility and manually label compat/breaking
to PRs with undeclared breaking changes. The change log will have to be regenerated if any new labels have to be added.
We use RFCs and GitHub issues to communicate development ideas. The simplest way to contribute to Feast is to leave comments in our RFCs in the Feast Google Drive or our GitHub issues. You will need to join our Google Group in order to get access.
We follow a process of lazy consensus. If you believe you know what the project needs then just start development. If you are unsure about which direction to take with development then please communicate your ideas through a GitHub issue or through our Slack Channel before starting development.
Please submit a PR to the master branch of the Feast repository once you are ready to submit your contribution. Code submission to Feast (including submission from project maintainers) require review and approval from maintainers or code owners.
PRs that are submitted by the general public need to be identified as ok-to-test
. Once enabled, Prow will run a range of tests to verify the submission, after which community members will help to review the pull request.
See also Community for other ways to get involved with the community (e.g. joining community calls)
Versioning policies and status of Feast components
Feast uses semantic versioning.
Contributors are encouraged to understand our branch workflow described below, for choosing where to branch when making a change (and thus the merge base for a pull request).
Major and minor releases are cut from the master
branch.
Each major and minor release has a long-lived maintenance branch, e.g., v0.3-branch
. This is called a "release branch".
From the release branch the pre-release release candidates are tagged, e.g., v0.3.0-rc.1
From the release candidates the stable patch version releases are tagged, e.g.,v0.3.0
.
A release branch should be substantially feature complete with respect to the intended release. Code that is committed to master
may be merged or cherry-picked on to a release branch, but code that is directly committed to a release branch should be solely applicable to that release (and should not be committed back to master).
In general, unless you're committing code that only applies to a particular release stream (for example, temporary hot-fixes, back-ported security fixes, or image hashes), you should base changes from master
and then merge or cherry-pick to the release branch.
The following table shows the status (stable, beta, or alpha) of Feast components.
Application status indicators for Feast:
Stable means that the component has reached a sufficient level of stability and adoption that the Feast community has deemed the component stable. Please see the stability criteria below.
Beta means that the component is working towards a version 1.0 release. Beta does not mean a component is unstable, it simply means the component has not met the full criteria of stability.
Alpha means that the component is in the early phases of development and/or integration into Feast.
Beta
APIs are considered stable and will not have breaking changes within 3 minor versions.
Beta
At risk of deprecation
Beta
Beta
Beta
Alpha
Alpha
Alpha
Scheduled for deprecation
Beta
Criteria for reaching stable status:
Contributors from at least two organizations
Complete end-to-end test suite
Scalability and load testing if applicable
Automated release process (docker images, PyPI packages, etc)
API reference documentation
No deprecative changes
Must include logging and monitoring
Criteria for reaching beta status
Contributors from at least two organizations
End-to-end test suite
API reference documentation
Deprecative changes must span multiple minor versions and allow for an upgrade path.
Feast components have various levels of support based on the component status.
Stable
The Feast community offers best-effort support for stable applications. Stable components will be offered long term support
Beta
The Feast community offers best-effort support for beta applications. Beta applications will be supported for at least 2 more minor releases.
Alpha
The response differs per application in alpha status, depending on the size of the community for that application and the current level of active development of the application.
Feast has an active and helpful community of users and contributors.
The Feast community offers support on a best-effort basis for stable and beta applications. Best-effort support means that there’s no formal agreement or commitment to solve a problem but the community appreciates the importance of addressing the problem as soon as possible. The community commits to helping you diagnose and address the problem if all the following are true:
The cause falls within the technical framework that Feast controls. For example, the Feast community may not be able to help if the problem is caused by a specific network configuration within your organization.
Community members can reproduce the problem.
The reporter of the problem can help with further diagnosis and troubleshooting.
Please see the Community page for channels through which support can be requested.
Feast users can choose to retrieve features from a feature server, as opposed to through the Python SDK.
Feast 0.10 brought about major changes to the way Feast is architected and how the software is intended to be deployed, extended, and operated.
Please see Upgrading from Feast 0.9 for a guide on how to upgrade to the latest Feast version.
Feast contributors identified various design challenges in Feast 0.9 that made deploying, operating, extending, and maintaining it challenging. These challenges applied both to users and contributors. Our goal is to make ML practitioners immediately productive in operationalizing data for machine learning. To that end, Feast 0.10+ made the following improvements on Feast 0.9:
Hard to install because it was a heavy-weight system with many components requiring a lot of configuration
Easy to install via pip install
Opinionated default configurations
No Helm charts necessary
Engineering support needed to deploy/operate reliably
Feast moves from a stack of services to a CLI/SDK
No need for Kubernetes or Spark
No long running processes or orchestrators
Leverages globally available managed services where possible
Hard to develop/debug with tightly coupled components, async operations, and hard to debug components like Spark
Easy to develop and debug
Modular components
Clear extension points
Fewer background operations
Faster feedback
Local mode
Inability to benefit from cloud-native technologies because of focus on reusable technologies like Kubernetes and Spark
Leverages best-in-class cloud technologies so users can enjoy scalable + powerful tech stacks without managing open source stacks themselves
Where Feast 0.9 was a large stack of components that needed to be deployed to Kubernetes, Feast 0.10 is simply a lightweight SDK and CLI. It doesn’t need any long-running processes to operate. This SDK/CLI can deploy and configure your feature store to your infrastructure, and execute workflows like building training datasets or reading features from an online feature store.
Feast 0.10 introduces local mode: Local mode allows users to try out Feast in a completely local environment (without using any cloud technologies). This provides users with a responsive means of trying out the software before deploying it into a production environment.
Feast comes with opinionated defaults: As much as possible we are attempting to make Feast a batteries-included feature store that removes the need for users to configure infinite configuration options (as with Feast 0.9). Feast 0.10 comes with sane default configuration options to deploy Feast on your infrastructure.
Feast Core was replaced by a file-based (S3, GCS) registry: Feast Core is a metadata server that maintains and exposes an API of feature definitions. With Feast 0.10, we’ve moved this entire service into a single flat file that can be stored on either the local disk or in a central object store like S3 or GCS. The benefit of this change is that users don’t need to maintain a database and a registry service, yet they can still access all the metadata they had before.
Materialization is a CLI operation: Instead of having ingestion jobs be managed by a job service, users can now schedule a batch ingestion job themselves by calling “materialize”. This change was introduced because most teams already have schedulers like Airflow in their organization. By starting ingestion jobs from Airflow, teams are now able to easily track state outside of Feast and to debug failures synchronously. Similarly, streaming ingestion jobs can be launched through the “apply” command
Doubling down on data warehouses: Most modern data teams are doubling down on data warehouses like BigQuery, Snowflake, and Redshift. Feast doubles down on these big data technologies as the primary interfaces through which it launches batch operations (like training dataset generation). This reduces the development burden on Feast contributors (since they only need to reason about SQL), provides users with a more responsive experience, avoids moving data from the warehouse (to compute joins using Spark), and provides a more serverless and scalable experience to users.
Temporary loss of streaming support: Unfortunately, Feast 0.10, 0.11, and 0.12 do not support streaming feature ingestion out of the box. It is entirely possible to launch streaming ingestion jobs using these Feast versions, but it requires the use of a Feast extension point to launch these ingestion jobs. It is still a core design goal for Feast to support streaming ingestion, so this change is in the development backlog for the Feast project.
**Addition of extension points: **Feast 0.10+ introduces various extension points. Teams can override all feature store behavior by writing (or extending) a provider. It is also possible for teams to add their own data storage connectors for both an offline and online store using a plugin interface that Feast provides.
Architecture
Service-oriented architecture
Containers and services deployed to Kubernetes
SDK/CLI centric software
Feast is able to deploy or configure infrastructure for use as a feature store
Installation
Terraform and Helm
Pip to install SDK/CLI
Provider used to deploy Feast components to GCP, AWS, or other environments during apply
Required infrastructure
Kubernetes, Postgres, Spark, Docker, Object Store
None
Batch compute
Yes (Spark based)
Python native (client-side) for batch data loading
Data warehouse for batch compute
Streaming support
Yes (Spark based)
Planned. Streaming jobs will be launched using apply
Offline store
None (can source data from any source Spark supports)
BigQuery, Snowflake (planned), Redshift, or custom implementations
Online store
Redis
DynamoDB, Firestore, Redis, and more planned.
Job Manager
Yes
No
Registry
gRPC service with Postgres backend
File-based registry with accompanying SDK for exploration
Local Mode
No
Yes
Please see the Feast 0.9 Upgrade Guide.
Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.
To enable this feature, run feast alpha enable aws_lambda_feature_server
The AWS Lambda feature server is only available to projects using the AwsProvider
with registries on S3. It is disabled by default. To enable it, feature_store.yaml
must be modified; specifically, the enable
flag must be on and an execution_role_name
must be specified. For example, after running feast init -t aws
, changing the registry to be on S3, and enabling the feature server, the contents of feature_store.yaml
should look similar to the following:
If enabled, the feature server will be deployed during feast apply
. After it is deployed, the feast endpoint
CLI command will indicate the server's endpoint.
Feast requires the following permissions in order to deploy and teardown AWS Lambda feature server:
The following inline policy can be used to grant Feast the necessary permissions:
After feature_store.yaml
has been modified as described in the previous section, it can be deployed as follows:
After the feature server starts, we can execute cURL commands against it:
The data source refers to raw underlying data (e.g. a table in BigQuery).
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.
Below is an example data source with a single entity (driver
) and two features (trips_today
, and rating
).
The AWS Lambda feature server is an HTTP endpoint that serves features with JSON I/O, deployed as a Docker image through AWS Lambda and AWS API Gateway. This enables users to get features from Feast using any programming language that can make HTTP requests. A is also available. A remote feature server on GCP Cloud Run is currently being developed.
lambda:CreateFunction
lambda:GetFunction
lambda:DeleteFunction
lambda:AddPermission
lambda:UpdateFunctionConfiguration
arn:aws:lambda:<region>:<account_id>:function:feast-*
ecr:CreateRepository
ecr:DescribeRepositories
ecr:DeleteRepository
ecr:PutImage
ecr:DescribeImages
ecr:BatchDeleteImage
ecr:CompleteLayerUpload
ecr:UploadLayerPart
ecr:InitiateLayerUpload
ecr:BatchCheckLayerAvailability
ecr:GetDownloadUrlForLayer
ecr:GetRepositoryPolicy
ecr:SetRepositoryPolicy
ecr:GetAuthorizationToken
*
iam:PassRole
arn:aws:iam::<account_id>:role/
apigateway:*
arn:aws:apigateway:::/apis//routes//routeresponses
arn:aws:apigateway:::/apis//routes//routeresponses/
arn:aws:apigateway:::/apis//routes/
arn:aws:apigateway:::/apis//routes
arn:aws:apigateway:::/apis//integrations
arn:aws:apigateway:::/apis//stages//routesettings/
arn:aws:apigateway:::/apis/
arn:aws:apigateway:*::/apis
Feast users use Feast to manage two important sets of configuration:
Configuration about how to run Feast on your infrastructure
Feature definitions
With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.
The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.
A feature repository consists of:
A collection of Python files containing feature declarations.
A feature_store.yaml
file containing infrastructural configuration.
A .feastignore
file containing paths in the feature repository to ignore.
Typically, users store their feature repositories in a Git repository, especially when working in teams. However, using Git is not a requirement.
The structure of a feature repository is as follows:
The root of the repository should contain a feature_store.yaml
file and may contain a .feastignore
file.
The repository should contain Python files that contain feature definitions.
The repository can contain other files as well, including documentation and potentially data files.
An example structure of a feature repository is shown below:
A couple of things to note about the feature repository:
Feast reads all Python files recursively when feast apply
is ran, including subdirectories, even if they don't contain feature definitions.
It's recommended to add .feastignore
and add paths to all imperative scripts if you need to store them inside the feature registry.
The configuration for a feature store is stored in a file named feature_store.yaml
, which must be located at the root of a feature repository. An example feature_store.yaml
file is shown below:
The feature_store.yaml
file configures how the feature store should run. See feature_store.yaml for more details.
This file contains paths that should be ignored when running feast apply
. An example .feastignore
is shown below:
See .feastignore for more details.
A feature repository can also contain one or more Python files that contain feature definitions. An example feature definition file is shown below:
To declare new feature definitions, just add code to the feature repository, either in existing files or in a new file. For more information on how to define features, see Feature Views.
See Create a feature repository to get started with an example feature repository.
See feature_store.yaml, .feastignore, or Feature Views for more information on the configuration files that live in a feature registry.
Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today
, and earnings_today
):
The above table can be registered with Feast through the following feature view:
Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.
Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats
feature view onto it, while preserving the trip_success
column:
The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:
For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time.
Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).
Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:
Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.
The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
Projects are currently being supported for backward compatibility reasons. Projects may change in the future as we simplify the Feast API.
This guide is targeted at developers looking to contribute to Feast:
Our preference is the use of git rebase
instead of git merge
: git pull -r
Commits have to be signed before they are allowed to be merged into the Feast codebase:
Fill in the description based on the default template configured when you first open the PR
What this PR does/why we need it
Which issue(s) this PR fixes
Does this PR introduce a user-facing change
Include kind
label when opening the PR
Add WIP:
to PR name if more work needs to be done prior to review
Avoid force-pushing
as it makes reviewing difficult
Managing CI-test failures
GitHub runner tests
Click checks
tab to analyse failed tests
Prow tests
Feast data storage contracts are documented in the following locations:
Feast Protobuf API defines the common API used by Feast's Components:
The language specific bindings have to be regenerated when changes are made to the Feast Protobuf API:
Design proposals in the form of Request for Comments (RFC).
User surveys and meeting minutes.
Slide decks of conferences our contributors have spoken at.
Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).
We have a user and contributor community call every two weeks (Asia & US friendly).
Please join the above Feast user groups in order to see calendar invites to the community calls
Tuesday 10:00 am to 10:30 am PST
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
Entities should be reused across feature views.
A related concept is an entity key. These are one or more entity values that uniquely describe a feature view record. In the case of an entity (like a driver
) that only has a single entity field, the entity is an entity key. However, it is also possible for an entity key to consist of multiple entity values. For example, a feature view with the composite entity of (customer, country) might have an entity key of (1001, 5).
Entity keys act as primary keys. They are used during the lookup of features from the online store, and they are also used to match feature rows across feature views during point-in-time joins.
Feast users use Feast to manage two important sets of configuration:
Configuration about how to run Feast on your infrastructure
Feature definitions
With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.
The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.
An example structure of a feature repository is shown below:
Learn How the Feast works.
Feast is composed of distributed into multiple repositories:
See also the CONTRIBUTING.md in the corresponding GitHub repository (e.g. )
Visit to analyse failed tests
: Used by BigQuery, Snowflake (Future), Redshift (Future).
: Used by Redis, Google Datastore.
Feast Protobuf API specifications are written in in the Main Feast Repository.
Changes to the API should be proposed via a for discussion first.
Speak to us: Have a question, feature request, idea, or just looking to speak to a real person? Set up a meeting with a Feast maintainer over !
: Feel free to ask questions or say hello!
: We have both a user and developer mailing list.
Feast users should join group by clicking .
Feast developers should join group by clicking .
People interested in the Feast community newsletter should join feast-announce by clicking .
: Includes community calls and design meetings.
: This folder is used as a central repository for all Feast resources. For example:
: Find the complete Feast codebase on GitHub.
: Our LFAI wiki page contains links to resources for contributors and maintainers.
GitHub Issues: Found a bug or need a feature? .
StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to .
Zoom:
Meeting notes (incl recordings):
Entities are typically defined as part of feature views. Entity name is used to reference the entity from a feature view definition and join key is used to identify the physical primary key on which feature values should be stored and retrieved. These keys are used during the lookup of feature values from the online store and the join process in point-in-time joins. It is possible to define composite entities (more than one entity object) in a feature view. It is also possible for feature views to have zero entities. See for more details.
For more details, see the reference.
Hosts all required code to run Feast. This includes the Feast Python SDK and Protobuf definitions. For legacy reasons this repository still contains Terraform config and a Go Client for Feast.
Python SDK / CLI
Protobuf APIs
Documentation
Go Client
Terraform
Java-specific Feast components. Includes the Feast Core Registry, Feast Serving for serving online feature values, and the Feast Java Client for retrieving feature values.
Core
Serving
Java Client
Feast Spark SDK & Feast Job Service for launching ingestion jobs and for building training datasets with Spark
Spark SDK
Job Service
Helm Chart for deploying Feast on Kubernetes & Spark.
Helm Chart
Python
Run make compile-protos-python
to generate bindings
Golang
Run make compile-protos-go
to generate bindings
Java
No action required: bindings are generated automatically during compilation.
Feast uses offline stores as storage and compute systems. Offline stores store historic time-series feature values. Feast does not generate these features, but instead uses the offline store as the interface for querying existing features in your organization.
Offline stores are used primarily for two reasons
Building training datasets from time-series features.
Materializing (loading) features from the offline store into an online store in order to serve those features at low latency for prediction.
Offline stores are configured through the feature_store.yaml. When building training datasets or materializing features into an online store, Feast will use the configured offline store along with the data sources you have defined as part of feature views to execute the necessary data operations.
It is not possible to query all data sources from all offline stores, and only a single offline store can be used at a time. For example, it is not possible to query a BigQuery table from a File
offline store, nor is it possible for a BigQuery
offline store to query files from your local file system.
Please see the Offline Stores reference for more details on configuring offline stores.
Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply
. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize
which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset and trains a model.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
A complete Feast deployment contains the following components:
Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
Manage version controlled feature definitions.
Materialize (load) feature values into the online store.
Build and retrieve training datasets from the offline store.
Retrieve online features.
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. Feast does not manage the offline store directly, but runs queries against it.
Java and Go Clients are also available for online feature retrieval.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is populated by materialization jobs and from .