Loading...
Loading...
Loading...
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.
Read the latest features from the offline store for batch scoring
Read the latest features from the online store for real-time inference.
Explore the (experimental) Feast UI
In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other data sources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
(Experimental) Feast enables light-weight feature transformations so users can re-use transformation logic across online / offline use cases and across models.
Install the Feast SDK and CLI using pip:
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example_repo.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
test_workflow.py
showcases how to run all key Feast commands, including defining, retrieving, and pushing features. You can run this with python test_workflow.py
.
The feature_store.yaml
file configures the key overall architecture of the feature store.
The provider value sets default offline and online stores.
The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).
The online store is a low latency store of the latest feature values (for powering real-time inference).
Valid values for provider
in feature_store.yaml
are:
local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store
gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)
aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)
The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.
There's an included test_workflow.py
file which runs through a full sample workflow:
Register feature definitions through feast apply
Generate a training dataset (using get_historical_features
)
Generate features for batch scoring (using get_historical_features
)
Ingest batch features into an online store (using materialize_incremental
)
Fetch online features to power real time inference (using get_online_features
)
Ingest streaming features into offline / online stores (using push
)
Verify online features are updated / fresher
We'll walk through some snippets of code below and explain
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py
and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store
in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.
Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.
Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.
To power a batch model, we primarily need to generate features with the get_historical_features
call, but using the current timestamp
We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental
serializes all new features since the last materialize
call).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
The driver_activity_v1
feature service pulls all features from the driver_hourly_stats
feature view:
View all registered features, data sources, entities, and feature services with the Web UI.
One of the ways to view this is with the feast ui
command.
test_workflow.py
Take a look at test_workflow.py
again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.
Loading...
Loading...
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.
Entities are used by Feast in many contexts, as we explore below:
Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:
zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)
Feast refers to this collection of entities for a feature view as an entity key.
Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.
Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.
At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch point-in-time correct features for to generate a training dataset.
At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).
Q: Can I retrieve features for all entities?
Kind of.
In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an open GitHub issue that welcomes contribution to make this a more intuitive API.
For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.
Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.
Feature views consist of:
a name to uniquely identify this feature view in the project.
(optional, but recommended) metadata (for example, description, or other free-form metadata via tags
)
(optional) a TTL, which limits how far back Feast will look when generating historical datasets
Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.
Feature views are used during
The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).
If the schema
parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply
by creating a Field
for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.
"Entity aliases" can be specified to join entity_dataframe
columns that do not match the column names in the source table of a FeatureView.
This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
It is suggested that you dynamically specify the new FeatureView name using .with_name
and join_key_map
override using .with_join_key_map
instead of needing to register each new copy.
A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer
entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py
.
Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:
On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.
Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.
This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could
Call get_historical_features
to generate a training dataframe
Iterate in notebook on feature engineering in Pandas
Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository
Verify with get_historical_features
(on a small dataset) that the transformation gives expected output over historical data
Verify with get_online_features
on dev branch that the transformation correctly outputs online features
Submit a pull request to the staging / prod branches which impact production traffic
A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.
Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:
Loading...
Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today
, and earnings_today
):
The above table can be registered with Feast through the following feature view:
Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.
Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats
feature view onto it, while preserving the trip_success
column:
The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:
For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time specified.
Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).
Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:
Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see
Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See for all supported data sources.
A custom setup can also be made by following .
The user can also query that table with a SQL query which pulls entities. See the documentation on for details
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found .
Join the to get new updates on Feast / feature stores.
Read the page to understand the Feast data model.
Read the page.
Check out our section for more examples on how to use Feast.
Follow our guide for a more in-depth tutorial on using Feast.
Join other Feast users and contributors in and become part of the community!
A feature view is an object that represents a logical group of time-series feature data as it is found in a . Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see ).
a
zero or more
If the features are not related to a specific object, the feature view might not have entities; see below.
(optional, but recommended) a schema specifying one or more (without this, Feast will infer the schema by reading from the data source)
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .
Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .
Feature names must be unique within a .
Each field can have additional metadata associated with it, specified as key-value .
See for a example of how to use stream feature views to register your own streaming data pipelines in Feast.