Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider
that has been configured in your feature_store.yaml file, as well as the feature definitions found in your feature repository.
Here we'll be using the example repository we created in the previous guide, Create a feature store. You can re-create it by running feast init
in a new directory.
To have Feast deploy your infrastructure, run feast apply
from your command line while inside a feature repository:
Depending on whether the feature repository is configured to use a local
provider or one of the cloud providers like GCP
or AWS
, it may take from a couple of seconds to a minute to run to completion.
At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize
. See Load data into the online store for more details.
If you need to clean up the infrastructure created by feast apply
, use the teardown
command.
Warning: teardown
is an irreversible command and will remove all feature store infrastructure. Proceed with caution!
****
Install Feast using pip:
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See Feature Repository for a detailed explanation of feature repositories.
The easiest way to create a new feature repository to use feast init
command:
The init
command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:
Enter the directory:
You can now use this feature repository for development. You can try the following:
Run feast apply
to apply these definitions to Feast.
Edit the example feature definitions in example.py
and run feast apply
again to change feature definitions.
Initialize a git repository in the same directory and checking the feature repository into version control.
Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production.
Models need consistent access to data: ML systems built on traditional data infrastructure are often coupled to databases, object stores, streams, and files. A result of this coupling, however, is that any change in data infrastructure may break dependent ML systems. Another challenge is that dual implementations of data retrieval for training and serving can lead to inconsistencies in data, which in turn can lead to training-serving skew.
Feast decouples your models from your data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval. Feast also provides a consistent means of referencing feature data for retrieval, and therefore ensures that models remain portable when moving from training to serving.
Deploying new features into production is difficult: Many ML teams consist of members with different objectives. Data scientists, for example, aim to deploy features into production as soon as possible, while engineers want to ensure that production systems remain stable. These differing objectives can create an organizational friction that slows time-to-market for new features.
Feast addresses this friction by providing both a centralized registry to which data scientists can publish features, and a battle-hardened serving layer. Together, these enable non-engineering teams to ship features into production with minimal oversight.
Models need point-in-time correct data: ML models in production require a view of data consistent with the one on which they are trained, otherwise the accuracy of these models could be compromised. Despite this need, many data science projects suffer from inconsistencies introduced by future feature values being leaked to models during training.
Feast solves the challenge of data leakage by providing point-in-time correct feature retrieval when exporting feature datasets for model training.
Features aren't reused across projects: Different teams within an organization are often unable to reuse features across projects. The siloed nature of development and the monolithic design of end-to-end ML systems contribute to duplication of feature creation and usage across teams and projects.
Feast addresses this problem by introducing feature reuse through a centralized system (a registry). This registry enables multiple teams working on different projects not only to contribute features, but also to reuse these same features. With Feast, data scientists can start new ML projects by selecting previously engineered features from a centralized registry, and are no longer required to develop new features for each project.
Feature engineering: We aim for Feast to support light-weight feature engineering as part of our API.
Feature discovery: We also aim for Feast to include a first-class user interface for exploring and discovering entities and features.
Feature validation: We additionally aim for Feast to improve support for statistics generation of feature data and subsequent validation of these statistics. Current support is limited.
ETL or ELT system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Feast plans to include a light-weight feature engineering toolkit, but we encourage teams to integrate Feast with upstream ETL/ELT systems that are specialized in transformation.
Data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
Data catalog: Feast is not a general purpose data catalog for your organization. Feast is purely focused on cataloging features for use in ML pipelines or systems, and only to the extent of facilitating the reuse of features.
The best way to learn Feast is to use it. Head over to our Quickstart and try it out!
Explore the following resources to get started with Feast:
Quickstart is the fastest way to get started with Feast
Getting started provides a step-by-step guide to using Feast.
Concepts describes all important Feast API concepts.
Reference contains detailed API and design documents.
Contributing contains resources for anyone who wants to contribute to Feast.
Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.
Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.
Start by defining the feature references (e.g., driver_trips:average_daily_rides
) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.
3. Create an entity dataframe
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp
and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.
It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.
Pandas:
In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp
column and a driver_id
entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.
SQL (Alternative):
Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).
4. Launch historical retrieval
Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features()
. This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df()
.
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Materialize feature values from the offline store into the online store.
Read the latest features from the online store for inference.
Install the Feast SDK and CLI using pip:
Bootstrap a new feature repository using feast init
from the command line:
The apply
command registers all the objects in your feature repository and deploys a feature store:
The apply
command builds a training dataset based on the time-series features defined in the feature repository:
The materialize
command loads the latest feature values from your feature views into your online store:
The top-level namespace within Feast is a . Users define one or more within a project. Each feature view contains one or more that relate to a specific . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
Projects are currently being supported for backward compatibility reasons. Projects may change in the future as we simplify the Feast API.
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.