Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The list below contains the functionality that contributors are planning to develop for Feast
Items below that are in development (or planned for development) will be indicated in parentheses.
We welcome contribution to all items in the roadmap!
Have questions about the roadmap? Go to the Slack channel to ask on #feast-development
Data Sources
Offline Stores
Online Stores
Feature Engineering
Streaming
Deployments
Feature Serving
Data Quality Management (See RFC)
Feature Discovery and Governance
A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. Feature views consist of zero or more entities, one or more features, and a data source. Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities; see feature views without entities below.
Feature views are used during
The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from stream sources.
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
Feast does not generate feature values. It acts as the ingestion and serving system. The data sources described within feature views should reference feature values in their already computed form.
If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only event timestamps are needed for this feature view).
If the features
parameter is not specified in the feature view creation, Feast will infer the features during feast apply
by creating a feature for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.
"Entity aliases" can be specified to join entity_dataframe
columns that do not match the column names in the source table of a FeatureView.
This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
It is suggested that you dynamically specify the new FeatureView name using .with_name
and join_key_map
override using .with_join_key_map
instead of needing to register each new copy.
A feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer
entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month.
Features are defined as part of feature views. Since Feast does not transform data, a feature is essentially a schema that only contains a name and a type:
Together with data sources, they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using feature references.
Feature names must be unique within a feature view.
On demand feature views allows users to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths:
Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.
Feast allows ML platform teams to:
Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (for serving pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
Note: Feast today primarily addresses timestamped structured data.
Feast helps ML platform teams with DevOps experience productionize real-time models. Feast can also help these teams build towards a feature platform that improves collaboration between engineers and data scientists.
Feast is likely not the right tool if you
are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is
rely primarily on unstructured data
need very low latency feature retrieval (e.g. p99 feature retrieval << 10ms)
have a small team to support a large number of use cases
a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like Airflow to make features consistently available.
a data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
a database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time
batch + streaming feature engineering: Feast primarily processes already transformed feature values (though it offers experimental light-weight transformations). Users usually integrate Feast with upstream systems (e.g. existing ETL/ELT pipelines). Tecton is a more fully featured feature platform which addresses these needs.
native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines. Tecton is a more fully featured feature platform which orchestrates end to end streaming pipelines.
feature sharing: Feast has experimental functionality to enable discovery and cataloguing of feature metadata with a Feast web UI (alpha). Feast also has community contributed plugins with DataHub and Amundsen. Tecton also more robustly addresses these needs.
lineage: Feast helps tie feature values to model versions, but is not a complete solution for capturing end-to-end lineage from raw data sources to model versions. Feast also has community contributed plugins with DataHub and Amundsen. Tecton captures more end-to-end lineage by also managing feature transformations.
data quality / drift detection: Feast has experimental integrations with Great Expectations, but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.
Many companies have used Feast to power real-world ML use cases such as:
Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch
Credit scoring, using pre-computed historical features to compute probability of default
The best way to learn Feast is to use it. Head over to our Quickstart and try it out!
Explore the following resources to get started with Feast:
Quickstart is the fastest way to get started with Feast
Concepts describes all important Feast API concepts
Architecture describes Feast's overall architecture.
Tutorials shows full examples of using Feast in machine learning applications.
Running Feast with Snowflake/GCP/AWS provides a more in-depth guide to using Feast.
Reference contains detailed API and design documents.
Contributing contains resources for anyone who wants to contribute to Feast.
A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.
Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.
Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.
A feature service is an object that represents a logical group of features from one or more feature views. Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.
Feature services are used during
The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now()
)
Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.
Applying a feature service does not result in an actual service being deployed.
Feature services enable referencing all or some features from a feature view.
Retrieving from the online store with a feature service
Retrieving from the offline store with a feature service
This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.
Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>
Feature references are used for the retrieval of features from Feast:
It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, It is not possible to reference (or retrieve) features from multiple projects at the same time.
Note, if you're using Feature views without entities, then those features can be added here without additional entity values in the entity_rows
The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.
Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.
Feast uses offline stores as storage and compute systems. Offline stores store historic time-series feature values. Feast does not generate these features, but instead uses the offline store as the interface for querying existing features in your organization.
Offline stores are used primarily for two reasons
Building training datasets from time-series features.
Materializing (loading) features from the offline store into an online store in order to serve those features at low latency for prediction.
Offline stores are configured through the feature_store.yaml. When building training datasets or materializing features into an online store, Feast will use the configured offline store along with the data sources you have defined as part of feature views to execute the necessary data operations.
It is not possible to query all data sources from all offline stores, and only a single offline store can be used at a time. For example, it is not possible to query a BigQuery table from a File
offline store, nor is it possible for a BigQuery
offline store to query files from your local file system.
Please see the Offline Stores reference for more details on configuring offline stores.
Please see the Push Source for reference on how to push features directly to the offline store in your feature store.
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Materialize feature values from the offline store into the online store.
Read the latest features from the online store for inference.
You can run this tutorial in Google Colab or run it on your localhost, following the guided steps below.
In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Feast alerts users to offline / online skew with data quality monitoring
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature reusability and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
Feast enables feature transformation so users can re-use transformation logic across online / offline usecases and across models.
Install the Feast SDK and CLI using pip:
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see Running Feast with Snowflake/GCP/AWS
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
The key line defining the overall architecture of the feature store is the provider. This defines where the raw data exists (for generating training data & feature values for serving), and where to materialize feature values to in the online store (for serving).
Valid values for provider
in feature_store.yaml
are:
local: use file source with SQLite/Redis
gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis
aws: use Redshift/Snowflake with DynamoDB/Redis
Note that there are many other sources Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See Third party integrations for all supported datasources.
A custom setup can also be made by following adding a custom provider.
The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example.py
(shown again below for convenience) and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by using the local
provider in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values).
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation. In many cases, Feast will also intelligently join relevant tables to create the relevant feature vectors.
Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.
We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental
serializes all new features since the last materialize
call).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same api below. More information can be found here.
The driver_activity
feature service pulls all features from the driver_hourly_stats
feature view:
View all registered features, data sources, entities, and feature services with the Web UI.
One of the ways to view this is with the feast ui
command.
Read the Concepts page to understand the Feast data model.
Read the Architecture page.
Check out our Tutorials section for more examples on how to use Feast.
Follow our Running Feast with Snowflake/GCP/AWS guide for a more in-depth tutorial on using Feast.
Join other Feast users and contributors in Slack and become part of the community!
This workshop aims to teach users about Feast.
We explain concepts & best practices by example, and also showcase how to address common use cases.
This workshop assumes you have the following installed:
A local development environment that supports running Jupyter notebooks (e.g. VSCode with Jupyter plugin)
Python 3.7+
Java 11 (for Spark, e.g. brew install java11
)
pip
Docker & Docker Compose (e.g. brew install docker docker-compose
)
AWS CLI
Since we'll be learning how to leverage Feast in CI/CD, you'll also need to fork this workshop repository.
These are meant mostly to be done in order, with examples building on previous concepts.
See https://github.com/feast-dev/feast-workshop
Don't see your question?
Feast expects that each version of a model corresponds to a different feature service.
Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, feast plan
and `feast apply will throw errors if it sees this kind of behavior.
Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.
get_historical_features
without providing an entity dataframe?Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).
It is a good idea though to lock down the registry file so only the CI/CD pipeline can modify it. That way data scientists and other users cannot accidentally modify the registry and lose other team's data.
There are several kinds of transformations:
These transformations are Pandas transformations run on batch data when you call get_historical_features
and at online serving time when you call `get_online_features.
Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well
These will include SQL + PySpark based transformations on batch data sources.
Streaming transformations (RFC in progress)
A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.
Yes. Specifically:
Simple lists / dense embeddings:
BigQuery supports list types natively
Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)
Sparse embeddings (e.g. one hot encodings)
Yes. Using a GCP or AWS provider in feature_store.yaml
primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.
Yes. For example, the Postgres connector can be used as both an offline and online store (as well as the registry).
Yes. There are two ways to use S3 in Feast:
Using the s3_endpoint_override
in a FileSource
data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.
Terraform ()
An AWS account setup with credentials via aws configure
(e.g see )
M1 Macbook development is untested with this flow. See also .
Windows development has only been tested with WSL. You will need to follow this to have Docker play nicely.
See also: ,
We encourage you to ask questions on or . Even better, once you get an answer, add the answer to this FAQ via a !
The is the easiest way to learn about Feast. For more detailed tutorials, please check out the page.
No, there are .
The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see and for more details.
Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See
Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support . Streaming transformations are actively being worked on.
On demand transformations (See )
Batch transformations (WIP, see )
Yes. See .
Please see a detailed comparison of Feast vs. Tecton . For another comparison, please see .
Feast is designed to work at scale and support low latency online serving. See our for details.
Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see )
One way to do this efficiently is to have a protobuf or string representation of
The list of supported offline and online stores can be found and , respectively. The indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about custom providers .
Please follow the instructions .
Using Redshift as a data source via Spectrum (), and then continuing with the guide. See a we did on this at our apply() meetup.
Feast supports ingestion via Spark (See ) does not support Spark natively. However, you can create a that will support Spark, which can help with more scalable materialization and ingestion.
Please see the .
For more details on contributing to the Feast community, see and this .
Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this for more details.
Please see this . If you have any questions or suggestions, feel free to leave a comment on the document!
Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the .
30-45
Setting up Feast projects & CI/CD + powering batch predictions
Module 0
15-20
Streaming ingestion & online feature retrieval with Kafka, Spark, Redis
Module 1
10-15
Real-time feature engineering with on demand transformations
Module 2
TBD
Feature server deployment (embed, as a service, AWS Lambda)
TBD
TBD
Versioning features / models in Feast
TBD
TBD
Data quality monitoring in Feast
TBD
TBD
Batch transformations
TBD
TBD
Stream transformations
TBD
Install Feast using pip:
Install Feast with Snowflake dependencies (required when using Snowflake):
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
Install Feast with AWS dependencies (required when using Redshift or DynamoDB):
Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):
Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.
When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.
In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.
This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.
This end-to-end tutorial will take you through the following steps:
Deploying S3 with Parquet as your primary data source, containing both loan features and zip code features
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving
Building a training dataset with Feast to train your credit scoring model
Loading feature values from S3 into DynamoDB
Making online predictions with your credit scoring model using features from DynamoDB
To have Feast deploy your infrastructure, run feast apply
from your command line while inside a feature repository:
Depending on whether the feature repository is configured to use a local
provider or one of the cloud providers like GCP
or AWS
, it may take from a couple of seconds to a minute to run to completion.
If you need to clean up the infrastructure created by feast apply
, use the teardown
command.
Warning: teardown
is an irreversible command and will remove all feature store infrastructure. Proceed with caution!
****
Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.
Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.
Start by defining the feature references (e.g., driver_trips:average_daily_rides
) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.
3. Create an entity dataframe
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp
and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.
It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.
Pandas:
In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp
column and a driver_id
entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.
SQL (Alternative):
Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).
4. Launch historical retrieval
Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features()
. This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df()
.
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
The easiest way to create a new feature repository to use feast init
command:
The init
command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:
Enter the directory:
You can now use this feature repository for development. You can try the following:
Run feast apply
to apply these definitions to Feast.
Edit the example feature definitions in example.py
and run feast apply
again to change feature definitions.
Initialize a git repository in the same directory and checking the feature repository into version control.
The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider
that has been configured in your file, as well as the feature definitions found in your feature repository.
Here we'll be using the example repository we created in the previous guide, . You can re-create it by running feast init
in a new directory.
At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize
. See for more details.
A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.
All Feast operations execute through a provider
. Operations like materializing data from the offline to the online store, updating infrastructure like databases, launching streaming ingestion jobs, building training datasets, and reading features from the online store.
Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include:
Launching custom streaming ingestion jobs (Spark, Beam)
Launching custom batch ingestion (materialization) jobs (Spark, Beam)
Adding custom validation to feature repositories during feast apply
Adding custom infrastructure setup logic which runs during feast apply
Extending Feast commands with in-house metrics, logging, or tracing
Feast comes with built-in providers, e.g, LocalProvider
, GcpProvider
, and AwsProvider
. However, users can develop their own providers by creating a class that implements the contract in the Provider class.
This guide also comes with a fully functional custom provider demo repository. Please have a look at the repository for a representative example of what a custom provider looks like, or fork the repository when creating your own provider.
The fastest way to add custom logic to Feast is to extend an existing provider. The most generic provider is the LocalProvider
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the provider methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom provider class. We've created the MyCustomProvider
below.
Notice how in the above provider we have only overwritten two of the methods on the LocalProvider
, namely update_infra
and materialize_single_feature_view
. These two methods are convenient to replace if you are planning to launch custom batch or streaming jobs. update_infra
can be used for launching idempotent streaming jobs, and materialize_single_feature_view
can be used for launching batch ingestion jobs.
It is possible to overwrite all the methods on the provider class. In fact, it isn't even necessary to subclass an existing provider like LocalProvider
. The only requirement for the provider class is that it follows the Provider contract.
Configure your feature_store.yaml file to point to your new provider class:
Notice how the provider
field above points to the module and class where your provider can be found.
Now you should be able to use your provider by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom provider!
Have a look at the custom provider demo repository for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider!
Feast makes adding support for a new offline store (database) easy. Developers can simply implement the OfflineStore interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).
In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.
The full working code for this guide can be found at feast-dev/feast-custom-offline-store-demo.
The process for using a custom offline store consists of 4 steps:
Defining an OfflineStore
class.
Defining an OfflineStoreConfig
class.
Defining a RetrievalJob
class for this offline store.
Defining a DataSource
class for the offline store
Referencing the OfflineStore
in a feature repo's feature_store.yaml
file.
Testing the OfflineStore
class.
OfflineStore class names must end with the OfflineStore suffix!
The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.
There are two methods that deal with reading data from the offline storesget_historical_features
and pull_latest_from_table_or_query
.
pull_latest_from_table_or_query
is invoked when running materialization (using the feast materialize
or feast materialize-incremental
commands, or the corresponding FeatureStore.materialize()
method. This method pull data from the offline store, and the FeatureStore
class takes care of writing this data into the online store.
get_historical_features
is invoked when reading values from the offline store using the FeatureStore.get_historical_features()
method. Typically, this method is used to retrieve features when training ML models.
pull_all_from_table_or_query
is a method that pulls all the data from an offline store from a specified start date to a specified end date.
Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.
To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel
class, which is defined here.
The FeastConfigBaseModel
is a pydantic class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.
This config class must container a type
field, which contains the fully qualified class name of its corresponding OfflineStore class.
Additionally, the name of the config class must be the same as the OfflineStore class, with the Config
suffix.
An example of the config class for the custom file offline store :
This configuration can be specified in the feature_store.yaml
as follows:
This configuration information is available to the methods of the OfflineStore, via theconfig: RepoConfig
parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store
field of the config
parameter.
The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob
instance, which represents the execution of the actual query against the underlying store.
Custom offline stores may need to implement their own instances of the RetrievalJob
interface.
The RetrievalJob
interface exposes two methods - to_df
and to_arrow
. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.
Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource
base class needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.
The data source class should implement two methods - from_proto
, and to_proto
.
For custom offline stores that are not being implemented in the main feature repo, the custom_options
field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto
method and reading the value back from bytes in the from_proto
method.
After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml
file, specifically in the offline_store
field. The value specified should be the fully qualified class name of the OfflineStore.
As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.
To use our custom file offline store, we can use the following feature_store.yaml
:
If additional configuration for the offline store is **not **required, then we can omit the other fields and only specify the type
of the offline store class as the value for the offline_store
.
Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.
Even if you have created the OfflineStore
class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. In the Feast submodule, we can run all the unit tests with:
The universal tests, which are integration tests specifically intended to test offline and online stores, can be run with:
The unit tests should succeed, but the universal tests will likely fail. The tests are parametrized based on the FULL_REPO_CONFIGS
variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py
. To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS
, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE
to point to that file. The main challenge there will be to write a DataSourceCreator
for the offline store. In this repo, the file that overwrites FULL_REPO_CONFIGS
is feast_custom_offline_store/feast_tests.py
, so you would run
to test the offline store against the Feast universal tests. You should notice that some of the tests actually fail; this indicates that there is a mistake in the implementation of this offline store!
Let's examine the Feast codebase. This analysis is accurate as of Feast 0.23.
The Python SDK lives in sdk/python/feast
. The majority of Feast logic lives in these Python files:
The core Feast objects (entities, feature views, data sources, etc.) are defined in their respective Python files, such as entity.py
, feature_view.py
, and data_source.py
.
The FeatureStore
class is defined in feature_store.py
and the associated configuration object (the Python representation of the feature_store.yaml
file) are defined in repo_config.py
.
The CLI and other core feature store logic are defined in cli.py
and repo_operations.py
.
The type system that is used to manage conversion between Feast types and external typing systems is managed in type_map.py
.
The Python feature server (the server that is started through the feast serve
command) is defined in feature_server.py
.
There are also several important submodules:
infra/
contains all the infrastructure components, such as the provider, offline store, online store, batch materialization engine, and registry.
dqm/
covers data quality monitoring, such as the dataset profiler.
diff/
covers the logic for determining how to apply infrastructure changes upon feature repo changes (e.g. the output of feast plan
and feast apply
).
embedded_go/
covers the Go feature server.
ui/
contains the embedded Web UI, to be launched on the feast ui
command.
Of these submodules, infra/
is the most important. It contains the interfaces for the provider, offline store, online store, batch materialization engine, and registry, as well as all of their individual implementations.
The tests for the Python SDK are contained in sdk/python/tests
. For more details, see this overview of the test suite.
feast apply
Let's walk through how feast apply
works by tracking its execution across the codebase.
All CLI commands are in cli.py
. Most of these commands are backed by methods in repo_operations.py
. The feast apply
command triggers apply_total_command
, which then calls apply_total
in repo_operations.py
.
With a FeatureStore
object (from feature_store.py
) that is initialized based on the feature_store.yaml
in the current working directory, apply_total
first parses the feature repo with parse_repo
and then calls either FeatureStore.apply
or FeatureStore._apply_diffs
to apply those changes to the feature store.
Let's examine FeatureStore.apply
. It splits the objects based on class (e.g. Entity
, FeatureView
, etc.) and then calls the appropriate registry method to apply or delete the object. For example, it might call self._registry.apply_entity
to apply an entity. If the default file-based registry is used, this logic can be found in infra/registry/registry.py
.
Then the feature store must update its cloud infrastructure (e.g. online store tables) to match the new feature repo, so it calls Provider.update_infra
, which can be found in infra/provider.py
.
Assuming the provider is a built-in provider (e.g. one of the local, GCP, or AWS providers), it will call PassthroughProvider.update_infra
in infra/passthrough_provider.py
.
This delegates to the online store and batch materialization engine. For example, if the feature store is configured to use the Redis online store then the update
method from infra/online_stores/redis.py
will be called. And if the local materialization engine is configured then the update
method from infra/materialization/local_engine.py
will be called.
At this point, the feast apply
command is complete.
feast materialize
Let's walk through how feast materialize
works by tracking its execution across the codebase.
The feast materialize
command triggers materialize_command
in cli.py
, which then calls FeatureStore.materialize
from feature_store.py
.
This then calls Provider.materialize_single_feature_view
, which can be found in infra/provider.py
.
As with feast apply
, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.materialize_single_feature_view
will be called.
This delegates to the underlying batch materialization engine. Assuming that the local engine has been configured, LocalMaterializationEngine.materialize
from infra/materialization/local_engine.py
will be called.
Since materialization involves reading features from the offline store and writing them to the online store, the local engine will delegate to both the offline store and online store. Specifically, it will call OfflineStore.pull_latest_from_table_or_query
and OnlineStore.online_write_batch
. These two calls will be routed to the offline store and online store that have been configured.
get_historical_features
Let's walk through how get_historical_features
works by tracking its execution across the codebase.
We start with FeatureStore.get_historical_features
in feature_store.py
. This method does some internal preparation, and then delegates the actual execution to the underlying provider by calling Provider.get_historical_features
, which can be found in infra/provider.py
.
As with feast apply
, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.get_historical_features
will be called.
That call simply delegates to OfflineStore.get_historical_features
. So if the feature store is configured to use Snowflake as the offline store, SnowflakeOfflineStore.get_historical_features
will be executed.
The java/
directory contains the Java serving component. See here for more details on how the repo is structured.
The go/
directory contains the Go feature server. Most of the files here have logic to help with reading features from the online store. Within go/
, the internal/feast/
directory contains most of the core logic:
onlineserving/
covers the core serving logic.
model/
contains the implementations of the Feast objects (entity, feature view, etc.).
For example, entity.go
is the Go equivalent of entity.py
. It contains a very simple Go implementation of the entity object.
registry/
covers the registry.
Currently only the file-based registry supported (the sql-based registry is unsupported). Additionally, the file-based registry only supports a file-based registry store, not the GCS or S3 registry stores.
onlinestore/
covers the online stores (currently only Redis and SQLite are supported).
Feast uses protobuf to store serialized versions of the core Feast objects. The protobuf definitions are stored in protos/feast
.
The ui/
directory contains the Web UI. See here for more details on the structure of the Web UI.
A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.
Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:
See here for a example of how to use stream feature views.
Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. Data Quality Monitoring was the primary motivation for creating dataset concept.
Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the offline store.
Dataset can be created from:
Results of historical retrieval
[planned] Logging request (including input for on demand transformation) and response during feature serving
[planned] Logging features during writing to online store (from batch source or stream)
To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features
method first and then pass the returned retrieval job to create_saved_dataset
method. create_saved_dataset
will trigger provided retrieval job (by calling .persist()
on it) to store the data using specified storage
. Storage type must be the same as globally configured offline store (eg, it's impossible to persist data to Redshift with BigQuery source). create_saved_dataset
will also create SavedDataset object with all related metadata and will write it to the registry.
Saved dataset can be later retrieved using get_saved_dataset
method:
Check out our tutorial on validating historical features to see how this concept can be applied in real-world use case.
Slack: Feel free to ask questions or say hello!
Mailing list: We have both a user and developer mailing list.
Feast users should join feast-discuss@googlegroups.com group by clicking here.
Feast developers should join feast-dev@googlegroups.com group by clicking here.
Community Calendar: Includes community calls and design meetings.
Google Folder: This folder is used as a central repository for all Feast resources. For example:
Design proposals in the form of Request for Comments (RFC).
User surveys and meeting minutes.
Slide decks of conferences our contributors have spoken at.
Feast GitHub Repository: Find the complete Feast codebase on GitHub.
Feast Linux Foundation Wiki: Our LFAI wiki page contains links to resources for contributors and maintainers.
Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).
GitHub Issues: Found a bug or need a feature? Create an issue on GitHub.
StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to StackOverflow.
We have a user and contributor community call every two weeks (US & EU friendly).
Please join the above Feast user groups in order to see calendar invites to the community calls
Tuesday 10:00 am to 10:30 am PST
Meeting notes (incl recordings): https://bit.ly/feast-notes
The data source refers to raw underlying data (e.g. a table in BigQuery).
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.
Below is an example data source with a single entity (driver
) and two features (trips_today
, and rating
).
The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
Projects are currently being supported for backward compatibility reasons. Projects may change in the future as we simplify the Feast API.
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
Entities are typically defined as part of feature views. Entity name is used to reference the entity from a feature view definition and join key is used to identify the physical primary key on which feature values should be stored and retrieved. These keys are used during the lookup of feature values from the online store and the join process in point-in-time joins. It is possible to define composite entities (more than one entity object) in a feature view. It is also possible for feature views to have zero entities. See feature view for more details.
Entities should be reused across feature views.
A related concept is an entity key. These are one or more entity values that uniquely describe a feature view record. In the case of an entity (like a driver
) that only has a single entity field, the entity is an entity key. However, it is also possible for an entity key to consist of multiple entity values. For example, a feature view with the composite entity of (customer, country) might have an entity key of (1001, 5).
Entity keys act as primary keys. They are used during the lookup of features from the online store, and they are also used to match feature rows across feature views during point-in-time joins.
Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.
Kinesis sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kinesis streams to a batch data source such as a data warehouse table. When using a Kinesis source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.
Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:
Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories
) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance
Note that the Kinesis source has a batch source.
The Kinesis source can be used in a stream feature view.
Snowflake tables and views are allowed as sources.
All joins happen within Snowflake.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be uploaded to Snowflake in order to complete join operations.
A SnowflakeRetrievalJob
is returned when calling get_historical_features()
.
This allows you to call
to_snowflake
to save the dataset into Snowflake
to_sql
to get the SQL query that would execute on to_df
This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.
Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
A SparkRetrievalJob
is returned when calling get_historical_features()
.
This allows you to call
to_df
to retrieve the pandas dataframe.
to_arrow
to retrieve the dataframe as a pyarrow Table.
to_spark_df
to retrieve the dataframe the spark.
The Feast registry is where all applied Feast objects (e.g. Feature views, entities, etc) are stored. The registry exposes methods to apply, list, retrieve and delete these objects. The registry is abstraction, with multiple possible implementations.
By default, the registry Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS).
However, there's inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today
, and earnings_today
):
The above table can be registered with Feast through the following feature view:
Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.
Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats
feature view onto it, while preserving the trip_success
column:
The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:
For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time.
Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).
Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:
Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.
Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply
. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize
which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset and trains a model.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
A complete Feast deployment contains the following components:
Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
Manage version controlled feature definitions.
Materialize (load) feature values into the online store.
Build and retrieve training datasets from the offline store.
Retrieve online features.
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. Feast does not manage the offline store directly, but runs queries against it.
Java and Go Clients are also available for online feature retrieval.
Feast users use Feast to manage two important sets of configuration:
Configuration about how to run Feast on your infrastructure
Feature definitions
With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.
The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.
An example structure of a feature repository is shown below:
The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.
Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports three different backends
Local
: Used as a local backend for storing the registry during development
S3
: Used as a centralized backend for storing the registry on AWS
GCS
: Used as a centralized backend for storing the registry on GCP
The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply
from the Feast CLI, but metadata about objects can also be updated during operations like materialization.
Users interact with a feature registry through the Feast SDK. Listing all feature views:
Or retrieving a specific feature view:
Kinesis sources allow users to register Kinesis streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kinesis. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through . An example of how to launch such a job with Spark to ingest from Kafka can be found ; by using a different plugin, the example can be adapted to Kinesis. Feast also provides functionality to write to the offline store using the write_to_offline_store
functionality.
See for a example of how to ingest data from a Kafka source into Feast. The approach used in the tutorial can be easily adapted to work for Kinesis as well.
The Snowflake offline store provides support for reading .
to_arrow_chunks
to get the result in batches ()
Configuration options are available in .
The Spark offline store is an offline store currently in alpha development that provides support for reading .
Alternatively, a can be used for a more scalable registry.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is populated by materialization jobs and from .
For more details, see the reference.
The feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp
provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local
, gcp
, and aws
) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp
provider but use Redis as the online store instead of Datastore.
If the built-in providers are not sufficient, you can create your own custom provider. Please see this guide for more details.
Please see feature_store.yaml for configuring providers.
Feast users can choose to retrieve features from a feature server, as opposed to through the Python SDK.
These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.
A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.
Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.
Our end-to-end example will perform the following workflows:
Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data
Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):
The Feast project logs anonymous usage statistics and errors in order to inform our planning. Several client methods are tracked, beginning in Feast 0.9. Users are assigned a UUID which is sent along with the name of the method, the Feast version, the OS (using sys.platform
), and the current time.
Set the environment variable FEAST_USAGE
to False
.
Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.
In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.
Try it and let us know what you think!
We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.
Don't see your offline store or online store of choice here? Check out our guides to make a custom one!
In order for a plugin integration to be highlighted, it must meet the following requirements:
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).
In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:
The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.
If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.
Initial demonstration of Snowflake as an offline store with Feast, using the Snowflake demo template.
In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store.
Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.
Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.
The following files will automatically be created in your project folder:
feature_store.yaml -- This is your main configuration file
driver_repo.py -- This is your main feature definition file
test.py -- This is a file to test your feature store configuration
feature_store.yaml
Here you will see the information that you entered. This template will use Snowflake as an offline store and SQLite as the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.
test.py
test.py
In this tutorial, we will use the public dataset of Chicago taxi trips to present data validation capabilities of Feast.
The original dataset is stored in BigQuery and consists of raw data for each taxi trip (one row per trip) since 2013.
We will generate several training datasets (aka historical features in Feast) for different periods and evaluate expectations made on one dataset against another.
Types of features we're ingesting and generating:
Features that aggregate raw data with daily intervals (eg, trips per day, average fare or speed for a specific day, etc.).
Features using SQL while pulling data from BigQuery (like total trips time or total miles travelled).
Features calculated on the fly when requested using Feast's on-demand transformations
Our plan:
Prepare environment
Pull data from BigQuery (optional)
Declare & apply features and feature views in Feast
Generate reference dataset
Develop & test profiler function
Run validation on different dataset using reference dataset & profiler
Install Feast Python SDK and great expectations:
You can skip this step if you don't have GCP account. Please use parquet files that are coming with this tutorial instead
Running some basic aggregations while pulling data from BigQuery. Grouping by taxi_id and day:
Generating range of timestamps with daily frequency:
Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
156984 rows × 2 columns
Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:
Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.
Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.
Loading saved dataset first and exploring the data:
156984 rows × 10 columns
Testing our profiler function:
Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).
Now we can create validation reference from dataset and profiler function:
and test it against our existing retrieval job
Validation successfully passed as no exception were raised.
Creating new timestamps for Dec 2020:
35448 rows × 2 columns
Execute retrieval job with validation reference:
Validation failed since several expectations didn't pass:
Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)
Average Fare increased - all quantiles are higher than expected
Earn per hour (mean) increased more than 10% (most probably due to increased fare)
The is available here.
This tutorial guides you on how to use Feast with . You will learn how to:
Train a model locally (on your laptop) using data from
Test the model for online inference using (for fast iteration)
Test the model for online inference using (for production use)
See
The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.
The original notebook and datasets for this tutorial can be found on .
Read more about feature views in
Read more about on demand feature views
Feast uses as a validation engine and as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.