Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
GitHub Repository: Find the complete Feast codebase on GitHub.
Slack: Feel free to ask questions or say hello! This is the main place where maintainers and contributors brainstorm and where users ask questions or discuss best practices.
Feast users should join #feast-general
or #feast-beginners
to ask questions
Feast developers / contributors should join #feast-development
Mailing list: We have both a user and developer mailing list.
Feast users should join feast-discuss@googlegroups.com group by clicking here.
Feast developers / contributors should join feast-dev@googlegroups.com group by clicking here.
Community Calendar: Includes community calls and design meetings.
Google Folder: This folder is used as a central repository for all Feast resources. For example:
Design proposals in the form of Request for Comments (RFC).
User surveys and meeting minutes.
Slide decks of conferences our contributors have spoken at.
Feast Linux Foundation Wiki: Our LFAI wiki page contains links to resources for contributors and maintainers.
Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).
GitHub Issues: Found a bug or need a feature? Create an issue on GitHub.
StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to StackOverflow.
We have a user and contributor community call every two weeks (US & EU friendly).
Please join the above Feast user groups in order to see calendar invites to the community calls
Tuesday 10:00 am to 10:30 am PST
Meeting notes (incl recordings): https://bit.ly/feast-notes
We also have a #feast-development
community call every two weeks, where we discuss contributions + brainstorm best practices.
Tuesday 8:00 am to 8:30 am PST
Meeting notes (incl recordings): Feast Development Biweekly
The list below contains the functionality that contributors are planning to develop for Feast.
We welcome contribution to all items in the roadmap!
Have questions about the roadmap? Go to the Slack channel to ask on #feast-development.
Data Sources
Offline Stores
Online Stores
Feature Engineering
Streaming
Deployments
Feature Serving
Data Quality Management (See RFC)
Feature Discovery and Governance
Feast uses a registry to store all applied Feast objects (e.g. Feature views, entities, etc). The registry exposes methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.
By default, Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS, or Azure).
The quickstart guides that use feast init
will use a registry on a local file system. To allow Feast to configure a remote file registry, you need to create a GCS / S3 bucket that Feast can understand:
However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
Alternatively, a SQL Registry can be used for a more scalable registry.
This supports any SQLAlchemy compatible database as a backend. The exact schema can be seen in sql.py
We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD automatically stays synced with the registry. Users will often also want multiple registries to correspond to different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write access since they can impact real user traffic. See Running Feast in Production for details on how to set this up.
Users can specify the registry through a feature_store.yaml
config file, or programmatically. We often see teams preferring the programmatic approach because it makes notebook driven development very easy:
feature_store.yaml
fileInstantiating a FeatureStore
object can then point to this:
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.
Read the latest features from the offline store for batch scoring
Read the latest features from the online store for real-time inference.
Explore the (experimental) Feast UI
In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other data sources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
(Experimental) Feast enables light-weight feature transformations so users can re-use transformation logic across online / offline use cases and across models.
Install the Feast SDK and CLI using pip:
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see Running Feast with Snowflake/GCP/AWS
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example_repo.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
test_workflow.py
showcases how to run all key Feast commands, including defining, retrieving, and pushing features. You can run this with python test_workflow.py
.
The feature_store.yaml
file configures the key overall architecture of the feature store.
The provider value sets default offline and online stores.
The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).
The online store is a low latency store of the latest feature values (for powering real-time inference).
Valid values for provider
in feature_store.yaml
are:
local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store
gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)
aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)
Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See Third party integrations for all supported data sources.
A custom setup can also be made by following Customizing Feast.
The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.
There's an included test_workflow.py
file which runs through a full sample workflow:
Register feature definitions through feast apply
Generate a training dataset (using get_historical_features
)
Generate features for batch scoring (using get_historical_features
)
Ingest batch features into an online store (using materialize_incremental
)
Fetch online features to power real time inference (using get_online_features
)
Ingest streaming features into offline / online stores (using push
)
Verify online features are updated / fresher
We'll walk through some snippets of code below and explain
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py
and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store
in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.
Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.
The user can also query that table with a SQL query which pulls entities. See the documentation on feature retrieval for details
Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.
To power a batch model, we primarily need to generate features with the get_historical_features
call, but using the current timestamp
We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental
serializes all new features since the last materialize
call).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found here.
The driver_activity_v1
feature service pulls all features from the driver_hourly_stats
feature view:
View all registered features, data sources, entities, and feature services with the Web UI.
One of the ways to view this is with the feast ui
command.
test_workflow.py
Take a look at test_workflow.py
again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.
Read the Concepts page to understand the Feast data model.
Read the Architecture page.
Check out our Tutorials section for more examples on how to use Feast.
Follow our Running Feast with Snowflake/GCP/AWS guide for a more in-depth tutorial on using Feast.
Join other Feast users and contributors in Slack and become part of the community!
A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.
Below is an example data source with a single entity column (driver
) and two feature columns (trips_today
, and rating
).
Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:
Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
[Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into on-demand feature views, which allow light-weight feature engineering and combining features across sources.
Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).
A key command to use in Feast is the materialize_incremental
command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
Materialization can be called programmatically or through the CLI:
If the schema
parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply
. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT
clause and inspecting the result.
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
To push data into the offline or online stores: see push sources for details.
(experimental) To use a contrib Spark processor to ingest from a topic, see Tutorial: Building streaming features
The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.
For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page (Data ingestion)
Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).
Feast supports several patterns of feature retrieval.
Training data generation
Fetching user and item features for (user, item) pairs when training a production recommendation model
get_historical_features
Offline feature retrieval for batch predictions
Predicting user churn for all users on a daily basis
get_historical_features
Online feature retrieval for real-time model predictions
Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent
get_online_features
Dataset can be created from:
Results of historical retrieval
[planned] Logging features during writing to online store (from batch source or stream)
To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features
method first and then pass the returned retrieval job to create_saved_dataset
method. create_saved_dataset
will trigger the provided retrieval job (by calling .persist()
on it) to store the data using the specified storage
behind the scenes. Storage type must be the same as the globally configured offline store (e.g it's impossible to persist data to a different offline source). create_saved_dataset
will also create a SavedDataset
object with all of the related metadata and will write this object to the registry.
Saved dataset can be retrieved later using the get_saved_dataset
method in the feature store:
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.
Entities are used by Feast in many contexts, as we explore below:
Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:
zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)
Feast refers to this collection of entities for a feature view as an entity key.
Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.
Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.
At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).
Q: Can I retrieve features for all entities?
Kind of.
For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.
Generally, Feast supports several patterns of feature retrieval:
Training data generation (via feature_store.get_historical_features(...)
)
Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...)
)
Online feature retrieval for real-time model predictions
via the SDK: feature_store.get_online_features(...)
via deployed feature server endpoints: requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))
Each of these retrieval mechanisms accept:
some way of specifying entities (to fetch features for)
For code examples of how the below work, inspect the generated repository from feast init -t [YOUR TEMPLATE]
(gcp
, snowflake
, and aws
are the most fully fleshed).
Before diving into how to retrieve features, we need to understand some high level concepts in Feast.
Feature services are used during
The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now()
)
Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.
Applying a feature service does not result in an actual service being deployed.
Feature services enable referencing all or some features from a feature view.
Retrieving from the online store with a feature service
Retrieving from the offline store with a feature service
This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.
Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>
Feature references are used for the retrieval of features from Feast:
It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time.
The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.
Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.
A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.
Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.
Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.
Feast abstracts away point-in-time join complexities with the get_historical_features
API.
We go through the major steps, and also show example code. Note that the quickstart templates generally have end-to-end working examples for all these cases.
Feast accepts either:
Feast accepts either a Pandas dataframe as the entity dataframe (including entity keys and timestamps) or a SQL query to generate the entities.
Both approaches must specify the full entity key needed as well as the timestamps. Feast then joins features onto this dataframe.
You can also pass a SQL string to generate the above dataframe. This is useful for getting all entities in a timeframe from some data source.
Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of entities and the corresponding features to be retrieved. Similar to get_historical_features
, we recommend using feature services as a mechanism for grouping features in a model version.
Note: unlike get_historical_features
, the entity_rows
do not need timestamps since you only want one feature value per entity key.
There are several options for retrieving online features: through the SDK, or through a feature server
Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.
Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .
[planned] Logging request (including input for ) and response during feature serving
Check out our to see how this concept can be applied in a real-world use case.
At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch correct features for to generate a training dataset.
In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an that welcomes contribution to make this a more intuitive API.
some way to specify the features to fetch (either via , which group features needed for a model version, or )
Before beginning, you need to instantiate a local FeatureStore
object that knows how to parse the registry (see )
A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.
Note, if you're using , then those features can be added here without additional entity values in the entity_rows
parameter.
, which group features needed for a model version
This approach requires you to deploy a feature server (see ).
A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.
A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see this guide for more details.
Please see feature_store.yaml for configuring engines.
The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.
Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports four different backends.
Local
: Used as a local backend for storing the registry during development
S3
: Used as a centralized backend for storing the registry on AWS
GCS
: Used as a centralized backend for storing the registry on GCP
[Alpha] Azure
: Used as centralized backend for storing the registry on Azure Blob storage.
The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply
from the Feast CLI, but metadata about objects can also be updated during operations like materialization.
Users interact with a feature registry through the Feast SDK. Listing all feature views:
Or retrieving a specific feature view:
Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize
command.
Here is an example batch data source:
Once the above data source is materialized into Feast (using feast materialize
), the feature values will be stored as follows:
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
Offline stores are primarily used for two reasons:
Building training datasets from time-series features.
Materializing (loading) features into an online store to serve those features at low-latency in a production setting.
Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery
offline store cannot be used to query a file-based data source.
Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply
. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize
which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset that can be used for training models.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
A complete Feast deployment contains the following components:
Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
Manage version controlled feature definitions.
Materialize (load) feature values into the online store.
Build and retrieve training datasets from the offline store.
Retrieve online features.
Stream Processor: The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
Java and Go Clients are also available for online feature retrieval.
We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.
Don't see your offline store or online store of choice here? Check out our guides to make a custom one!
In order for a plugin integration to be highlighted, it must meet the following requirements:
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).
In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:
The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.
If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.
The feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.
The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each , only the latest feature values are stored. No historical values are stored.
Features can also be written directly to the online store via .
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp
provider supports as an offline store and as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local
, gcp
, and aws
) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp
provider but use Redis as the online store instead of Datastore.
If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.
Please see for configuring providers.
An offline store is an interface for working with historical time-series feature values that are stored in . The OfflineStore
interface has several different implementations, such as the BigQueryOfflineStore
, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see .
Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.
Please see for more details on how to push features directly to the offline store in your feature store.
Create Stream Features: Stream features are created from streaming services such as Kafka or Kinesis, and can be pushed directly into Feast via the .
Batch Materialization Engine: The component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through .
See
The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.
Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.
In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.
This tutorial guides you on how to use Feast with Scikit-learn. You will learn how to:
Train a model locally (on your laptop) using data from BigQuery
Test the model for online inference using SQLite (for fast iteration)
Test the model for online inference using Firestore (for production use)
Try it and let us know what you think!
Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.
When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.
In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.
This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.
This end-to-end tutorial will take you through the following steps:
Deploying S3 with Parquet as your primary data source, containing both loan features and zip code features
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving
Building a training dataset with Feast to train your credit scoring model
Loading feature values from S3 into DynamoDB
Making online predictions with your credit scoring model using features from DynamoDB
These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.
A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.
Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.
Our end-to-end example will perform the following workflows:
Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data
Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):
Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.
Feast allows ML platform teams to:
Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
Note: Feast today primarily addresses timestamped structured data.
Feast helps ML platform teams with DevOps experience productionize real-time models. Feast can also help these teams build towards a feature platform that improves collaboration between engineers and data scientists.
Feast is likely not the right tool if you
are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is
rely primarily on unstructured data
need very low latency feature retrieval (e.g. p99 feature retrieval << 10ms)
have a small team to support a large number of use cases
a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like Airflow to make features consistently available.
a data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
a database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time
batch + streaming feature engineering: Feast primarily processes already transformed feature values (though it offers experimental light-weight transformations). Users usually integrate Feast with upstream systems (e.g. existing ETL/ELT pipelines). Tecton is a more fully featured feature platform which addresses these needs.
native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines. Tecton is a more fully featured feature platform which orchestrates end to end streaming pipelines.
feature sharing: Feast has experimental functionality to enable discovery and cataloguing of feature metadata with a Feast web UI (alpha). Feast also has community contributed plugins with DataHub and Amundsen. Tecton also more robustly addresses these needs.
lineage: Feast helps tie feature values to model versions, but is not a complete solution for capturing end-to-end lineage from raw data sources to model versions. Feast also has community contributed plugins with DataHub and Amundsen. Tecton captures more end-to-end lineage by also managing feature transformations.
data quality / drift detection: Feast has experimental integrations with Great Expectations, but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.
Many companies have used Feast to power real-world ML use cases such as:
Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch
Credit scoring, using pre-computed historical features to compute probability of default
The best way to learn Feast is to use it. Join our Slack channel and head over to our Quickstart and try it out!
Explore the following resources to get started with Feast:
Quickstart is the fastest way to get started with Feast
Concepts describes all important Feast API concepts
Architecture describes Feast's overall architecture.
Tutorials shows full examples of using Feast in machine learning applications.
Running Feast with Snowflake/GCP/AWS provides a more in-depth guide to using Feast.
Reference contains detailed API and design documents.
Contributing contains resources for anyone who wants to contribute to Feast.
Tutorial on how to use the SQL registry for scalable registry updates
By default, the registry Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS).
However, there's inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
PostgreSQL
MySQL
Sqlite
Feast can use the SQL Registry via a config change in the feature_store.yaml file. An example of how to configure this would be:
There are some things to note about how the SQL registry works:
Once instantiated, the Registry ensures the tables needed to store data exist, and creates them if they do not.
Upon tearing down the feast project, the registry ensures that the tables are dropped from the database.
The schema for how data is laid out in tables can be found . It is intentionally simple, storing the serialized protobuf versions of each Feast object keyed by its name.
The SQL Registry should be used when materializing feature views concurrently to ensure correctness of data in the registry. This can be achieved by simply running feast materialize or feature_store.materialize multiple times using a correctly configured feature_store.yaml. This will make each materialization process talk to the registry database concurrently, and ensure the metadata updates are serialized.
An alternative to the file-based registry is the which ships with Feast. This implementation stores the registry in a relational database, and allows for changes to individual objects atomically. Under the hood, the SQL Registry implementation uses to abstract over the different databases. Consequently, any by SQLAlchemy can be used by the SQL Registry. The following databases are supported and tested out of the box:
Specifically, the registry_type needs to be set to sql in the registry config block. On doing so, the path should refer to the for the database to be used, as expected by SQLAlchemy. No other additional commands are currently needed to configure this registry.
Install Feast using pip:
Install Feast with Snowflake dependencies (required when using Snowflake):
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
Install Feast with AWS dependencies (required when using Redshift or DynamoDB):
Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):
Initial demonstration of Snowflake as an offline+online store with Feast, using the Snowflake demo template.
In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store + materialization engine + online store.
Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.
Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.
The following files will automatically be created in your project folder:
feature_store.yaml -- This is your main configuration file
driver_repo.py -- This is your main feature definition file
test.py -- This is a file to test your feature store configuration
feature_store.yaml
Here you will see the information that you entered. This template will use Snowflake as the offline store, materialization engine, and the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.
test.py
test.py
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.
However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.
Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.
The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.
The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).
Feast supports pluggable , that allow the materialization process to be scaled up. Aside from the local process, Feast supports a , and a .
Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.
Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.
Start by defining the feature references (e.g., driver_trips:average_daily_rides
) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.
3. Create an entity dataframe
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp
and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.
It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.
Pandas:
In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp
column and a driver_id
entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.
SQL (Alternative):
Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).
4. Launch historical retrieval
Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features()
. This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df()
.
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set). Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
There are three common ways teams approach having separate environments
Have separate git branches for each environment
Have separate feature_store.yaml
files and separate Feast object definitions that correspond to each environment
Have separate feature_store.yaml
files per environment, but share the Feast object definitions
To keep a clear separation of the feature repos, teams may choose to have multiple long-lived branches in their version control system, one for each environment. In this approach, with CI/CD setup, changes would first be made to the staging branch, and then copied over manually to the production branch once verified in the staging environment.
feature_store.yaml
files and separate Feast object definitionsThe contents of this repository are shown below:
The repository contains three sub-folders:
staging/
: This folder contains the staging feature_store.yaml
and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.
production/
: This folder contains the production feature_store.yaml
and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.
.github
: This folder is an example of a CI system that applies the changes in either the staging
or production
repositories using feast apply
. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.
The feature_store.yaml
contains the following:
Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply
are tracked in the registry.db
. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.
It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.
If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:
feature_store.yaml
filesThis approach is very similar to the previous approach, but instead of having feast objects duplicated and having to copy over changes, it may be possible to share the same Feast object definitions and have different feature_store.yaml
configuration.
An example of how such a repository would be structured is as follows:
Users can then apply the applying them to each environment in this way:
This setup has the advantage that you can share the feature definitions entirely, which may prevent issues with copy-pasting code.
In summary, once you have set up a Git based repository with CI that runs feast apply
on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.
For this approach, we have created an example repository () which contains two Feast projects, one per environment.
The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider
that has been configured in your feature_store.yaml file, as well as the feature definitions found in your feature repository.
Here we'll be using the example repository we created in the previous guide, Create a feature store. You can re-create it by running feast init
in a new directory.
To have Feast deploy your infrastructure, run feast apply
from your command line while inside a feature repository:
Depending on whether the feature repository is configured to use a local
provider or one of the cloud providers like GCP
or AWS
, it may take from a couple of seconds to a minute to run to completion.
At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize
. See Load data into the online store for more details.
If you need to clean up the infrastructure created by feast apply
, use the teardown
command.
Warning: teardown
is an irreversible command and will remove all feature store infrastructure. Proceed with caution!
****
After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise (e.g. on Kubernetes).
A typical production architecture looks like:
Important note: Feast is highly customizable and modular.
Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.
Additionally, please check the how-to guide for some specific recommendations on how to scale Feast.
In this guide we will show you how to:
Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date (from batch and stream sources)
Use Feast for model training and serving
The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running feast apply
commits feature definitions to a registry, which users can then read elsewhere.
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available here.
We recommend typically setting up CI/CD to automatically run feast plan
and feast apply
when pull requests are opened / merged.
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set).
Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
Different options are presented in the how-to guide.
To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.
Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.
This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. In this case, we recommend using one of the more scalable materialization engines, such as the Bytewax Materialization Engine, or the Snowflake Materialization Engine. Users may also need to write a custom materialization engine to work on their existing infrastructure.
The Bytewax materialization engine can run materialization on an existing Kubernetes cluster. An example configuration of this in a feature_store.yaml
is as follows:
See also data ingestion for code snippets
It is up to you to orchestrate and schedule runs of materialization.
Feast keeps the history of materialization in its registry so that the choice could be as simple as a unix cron util. Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently.
However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
If you are using Airflow as a scheduler, Feast can be invoked through a PythonOperator after the Python SDK has been installed into a virtual environment and your feature repo has been synced:
Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.
See more details at data ingestion, which shows how to ingest streaming features or 3rd party feature data via a push API.
This supports pushing feature values into Feast to both online or offline stores.
For more details, see feature retrieval
After we've defined our features and data sources in the repository, we can generate training datasets. We highly recommend you use a FeatureService
to version the features that go into a specific model version.
The first thing we need to do in our training code is to create a FeatureStore
object with a path to the registry.
One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml
to those pipelines. This feature_store.yaml
file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
Then, you need to generate an entity dataframe. You have two options
Create an entity dataframe manually and pass it in
Use a SQL query to dynamically generate lists of entities (e.g. all entities within a time range) and timestamps to pass into Feast
Then, training data can be retrieved as follows:
The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the feature service name and the model versions have some established convention.
For example, in MLflow:
It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.
Once you have successfully loaded data from batch / streaming sources into the online store, you can start consuming features for model inference.
This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.
To integrate online retrieval into your service use the following code:
To deploy a Feast feature server on Kubernetes, you can use the included helm chart + tutorial (which also has detailed instructions and an example tutorial).
Basic steps
Add the Feast Helm repository and download the latest charts:
Run Helm Install
This will deploy a single service. The service must have read access to the registry file on cloud storage. It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the ${ENV_VAR}
syntax in your feature_store.yaml
file. For instance:
It is possible to set a default value if the environment variable is not set, with ${ENV_VAR:"default"}
. For instance:
In summary, the overall architecture in production may look like:
Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
Data ingestion
Batch data: Airflow manages materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.
If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.
If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).
Stream data: The Feast Push API is used within existing Spark / Beam pipelines to push feature values to offline / online stores
Online features are served via the Python feature server over HTTP, or consumed using the Feast Python SDK.
Feast Python SDK is called locally to generate a training dataset
Feast is highly pluggable and configurable:
One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.
The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.
Below are some guides on how to add new custom components:
Feast batch materialization operations (materialize
and materialize-incremental
) execute through a BatchMaterializationEngine
.
Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include:
Setting up custom materialization-specific infrastructure during feast apply
(e.g. setting up Spark clusters or Lambda Functions)
Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda)
Tearing down custom materialization-specific infrastructure during feast teardown
(e.g. tearing down Spark clusters, or deleting Lambda Functions)
The fastest way to add custom logic to Feast is to extend an existing materialization engine. The most generic engine is the LocalMaterializationEngine
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom materialization engine class. We've created the MyCustomEngine
below.
Notice how in the above engine we have only overwritten two of the methods on the LocalMaterializatinEngine
, namely update
and materialize
. These two methods are convenient to replace if you are planning to launch custom batch jobs.
Notice how the batch_engine
field above points to the module and class where your engine can be found.
Now you should be able to use your engine by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom engine!
This guide will go over:
how Feast tests are setup
how to extend the test suite to test new functionality
how to use the existing test suite to test a new custom offline / online store
Unit tests are contained in sdk/python/tests/unit
. Integration tests are contained in sdk/python/tests/integration
. Let's inspect the structure of sdk/python/tests/integration
:
feature_repos
has setup files for most tests in the test suite.
The tests are organized by which Feast component(s) they test.
The universal feature repo refers to a set of fixtures (e.g. environment
and universal_data_sources
) that can be parametrized to cover various combinations of offline stores, online stores, and providers. This allows tests to run against all these various combinations without requiring excess code. The universal feature repo is constructed by fixtures in conftest.py
with help from the various files in feature_repos
.
Tests in Feast are split into integration and unit tests. If a test requires external resources (e.g. cloud resources on GCP or AWS), it is an integration test. If a test can be run purely locally (where locally includes Docker resources), it is a unit test.
Integration tests test non-local Feast behavior. For example, tests that require reading data from BigQuery or materializing data to DynamoDB are integration tests. Integration tests also tend to involve more complex Feast functionality.
Unit tests test local Feast behavior. For example, tests that only require registering feature views are unit tests. Unit tests tend to only involve simple Feast functionality.
E2E tests
E2E tests test end-to-end functionality of Feast over the various codepaths (initialize a feature store, apply, and materialize).
The main codepaths include:
basic e2e tests for offline stores
test_universal_e2e.py
go feature server
test_go_feature_server.py
python http server
test_python_feature_server.py
usage tracking
test_usage_e2e.py
data quality monitoring feature validation
test_validation.py
Offline and Online Store Tests
Offline and online store tests mainly test for the offline and online retrieval functionality.
The various specific functionalities that are tested include: