Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.
Below is an example data source with a single entity column (driver
) and two feature columns (trips_today
, and rating
).
Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:
Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
[Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into on-demand feature views, which allow light-weight feature engineering and combining features across sources.
Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).
A key command to use in Feast is the materialize_incremental
command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
Materialization can be called programmatically or through the CLI:
If the schema
parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply
. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT
clause and inspecting the result.
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
To push data into the offline or online stores: see push sources for details.
(experimental) To use a contrib Spark processor to ingest from a topic, see Tutorial: Building streaming features
In this tutorial we will
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.
Read the latest features from the offline store for batch scoring
Read the latest features from the online store for real-time inference.
Explore the (experimental) Feast UI
In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other data sources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
(Experimental) Feast enables light-weight feature transformations so users can re-use transformation logic across online / offline use cases and across models.
Install the Feast SDK and CLI using pip:
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see Running Feast with Snowflake/GCP/AWS
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example_repo.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
test_workflow.py
showcases how to run all key Feast commands, including defining, retrieving, and pushing features. You can run this with python test_workflow.py
.
The feature_store.yaml
file configures the key overall architecture of the feature store.
The provider value sets default offline and online stores.
The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).
The online store is a low latency store of the latest feature values (for powering real-time inference).
Valid values for provider
in feature_store.yaml
are:
local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store
gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)
aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)
Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See Third party integrations for all supported data sources.
A custom setup can also be made by following Customizing Feast.
The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.
There's an included test_workflow.py
file which runs through a full sample workflow:
Register feature definitions through feast apply
Generate a training dataset (using get_historical_features
)
Generate features for batch scoring (using get_historical_features
)
Ingest batch features into an online store (using materialize_incremental
)
Fetch online features to power real time inference (using get_online_features
)
Ingest streaming features into offline / online stores (using push
)
Verify online features are updated / fresher
We'll walk through some snippets of code below and explain
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py
and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store
in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.
Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.
The user can also query that table with a SQL query which pulls entities. See the documentation on feature retrieval for details
Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.
To power a batch model, we primarily need to generate features with the get_historical_features
call, but using the current timestamp
We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental
serializes all new features since the last materialize
call).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found here.
The driver_activity_v1
feature service pulls all features from the driver_hourly_stats
feature view:
View all registered features, data sources, entities, and feature services with the Web UI.
One of the ways to view this is with the feast ui
command.
test_workflow.py
Take a look at test_workflow.py
again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.
Join the email newsletter to get new updates on Feast / feature stores.
Read the Concepts page to understand the Feast data model.
Read the Architecture page.
Check out our Tutorials section for more examples on how to use Feast.
Follow our Running Feast with Snowflake/GCP/AWS guide for a more in-depth tutorial on using Feast.
Join other Feast users and contributors in Slack and become part of the community!
Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.
Feast allows ML platform teams to:
Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
Note: Feast today primarily addresses timestamped structured data.
Feast helps ML platform teams with DevOps experience productionize real-time models. Feast can also help these teams build towards a feature platform that improves collaboration between engineers and data scientists.
Feast is likely not the right tool if you
are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is
rely primarily on unstructured data
need very low latency feature retrieval (e.g. p99 feature retrieval << 10ms)
have a small team to support a large number of use cases
a data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
a database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time
Many companies have used Feast to power real-world ML use cases such as:
Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch
Credit scoring, using pre-computed historical features to compute probability of default
Explore the following resources to get started with Feast:
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.
Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).
Feast supports several patterns of feature retrieval.
Feast users should join #feast-general
or #feast-beginners
to ask questions
Feast developers / contributors should join #feast-development
Design proposals in the form of Request for Comments (RFC).
User surveys and meeting minutes.
Slide decks of conferences our contributors have spoken at.
Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).
We have a user and contributor community call every two weeks (US & EU friendly).
Please join the above Feast user groups in order to see calendar invites to the community calls
Tuesday 10:00 am to 10:30 am PST
We also have a #feast-development
community call every two weeks, where we discuss contributions + brainstorm best practices.
Tuesday 8:00 am to 8:30 am PST
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.
Entities are used by Feast in many contexts, as we explore below:
Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:
zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)
Feast refers to this collection of entities for a feature view as an entity key.
Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.
Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.
At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).
Q: Can I retrieve features for all entities?
Kind of.
For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.
Generally, Feast supports several patterns of feature retrieval:
Training data generation (via feature_store.get_historical_features(...)
)
Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...)
)
Online feature retrieval for real-time model predictions
via the SDK: feature_store.get_online_features(...)
via deployed feature server endpoints: requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))
Each of these retrieval mechanisms accept:
some way of specifying entities (to fetch features for)
For code examples of how the below work, inspect the generated repository from feast init -t [YOUR TEMPLATE]
(gcp
, snowflake
, and aws
are the most fully fleshed).
Before diving into how to retrieve features, we need to understand some high level concepts in Feast.
Feature services are used during
The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now()
)
Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.
Applying a feature service does not result in an actual service being deployed.
Feature services enable referencing all or some features from a feature view.
Retrieving from the online store with a feature service
Retrieving from the offline store with a feature service
This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.
Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>
Feature references are used for the retrieval of features from Feast:
It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time.
The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.
Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.
A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.
Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.
Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.
Feast abstracts away point-in-time join complexities with the get_historical_features
API.
We go through the major steps, and also show example code. Note that the quickstart templates generally have end-to-end working examples for all these cases.
Feast accepts either:
Feast accepts either a Pandas dataframe as the entity dataframe (including entity keys and timestamps) or a SQL query to generate the entities.
Both approaches must specify the full entity key needed as well as the timestamps. Feast then joins features onto this dataframe.
You can also pass a SQL string to generate the above dataframe. This is useful for getting all entities in a timeframe from some data source.
Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of entities and the corresponding features to be retrieved. Similar to get_historical_features
, we recommend using feature services as a mechanism for grouping features in a model version.
Note: unlike get_historical_features
, the entity_rows
do not need timestamps since you only want one feature value per entity key.
There are several options for retrieving online features: through the SDK, or through a feature server
an / system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Users often leverage tools like to manage upstream data transformations.
a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like to make features consistently available.
reproducible model training / model backtesting / experiment management: Feast captures feature and model metadata, but does not version-control datasets / labels or manage train / test splits. Other tools like , , and are better suited for this.
batch + streaming feature engineering: Feast primarily processes already transformed feature values (though it offers experimental light-weight transformations). Users usually integrate Feast with upstream systems (e.g. existing ETL/ELT pipelines). is a more fully featured feature platform which addresses these needs.
native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines. is a more fully featured feature platform which orchestrates end to end streaming pipelines.
feature sharing: Feast has experimental functionality to enable discovery and cataloguing of feature metadata with a . Feast also has community contributed plugins with and . also more robustly addresses these needs.
lineage: Feast helps tie feature values to model versions, but is not a complete solution for capturing end-to-end lineage from raw data sources to model versions. Feast also has community contributed plugins with and . captures more end-to-end lineage by also managing feature transformations.
data quality / drift detection: Feast has experimental integrations with , but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.
The best way to learn Feast is to use it. Join our and head over to our and try it out!
is the fastest way to get started with Feast
describes all important Feast API concepts
describes Feast's overall architecture.
shows full examples of using Feast in machine learning applications.
provides a more in-depth guide to using Feast.
contains detailed API and design documents.
contains resources for anyone who wants to contribute to Feast.
The top-level namespace within Feast is a project. Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.
For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page ()
: Find the complete Feast codebase on GitHub.
: See the governance model of Feast, including who the maintainers are and how decisions are made.
: Feel free to ask questions or say hello! This is the main place where maintainers and contributors brainstorm and where users ask questions or discuss best practices.
: We have both a user and developer mailing list.
Feast users should join group by clicking .
Feast developers / contributors should join group by clicking .
: Includes community calls and design meetings.
: This folder is used as a central repository for all Feast resources. For example:
: Our LFAI wiki page contains links to resources for contributors and maintainers.
GitHub Issues: Found a bug or need a feature? .
StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to .
Zoom:
Meeting notes (incl recordings):
Meeting notes (incl recordings):
Zoom:
At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch correct features for to generate a training dataset.
In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an that welcomes contribution to make this a more intuitive API.
some way to specify the features to fetch (either via , which group features needed for a model version, or )
Before beginning, you need to instantiate a local FeatureStore
object that knows how to parse the registry (see )
A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.
Note, if you're using , then those features can be added here without additional entity values in the entity_rows
parameter.
, which group features needed for a model version
This approach requires you to deploy a feature server (see ).
Training data generation
Fetching user and item features for (user, item) pairs when training a production recommendation model
get_historical_features
Offline feature retrieval for batch predictions
Predicting user churn for all users on a daily basis
get_historical_features
Online feature retrieval for real-time model predictions
Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent
get_online_features
Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.
A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see [Alpha] On demand feature views).
Feature views consist of:
zero or more entities
If the features are not related to a specific object, the feature view might not have entities; see feature views without entities below.
a name to uniquely identify this feature view in the project.
(optional, but recommended) a schema specifying one or more features (without this, Feast will infer the schema by reading from the data source)
(optional, but recommended) metadata (for example, description, or other free-form metadata via tags
)
(optional) a TTL, which limits how far back Feast will look when generating historical datasets
Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.
Feature views are used during
The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from stream sources.
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).
If the schema
parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply
by creating a Field
for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.
"Entity aliases" can be specified to join entity_dataframe
columns that do not match the column names in the source table of a FeatureView.
This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
It is suggested that you dynamically specify the new FeatureView name using .with_name
and join_key_map
override using .with_join_key_map
instead of needing to register each new copy.
A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer
entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py
.
Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:
Together with data sources, they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using feature references.
Feature names must be unique within a feature view.
Each field can have additional metadata associated with it, specified as key-value tags.
On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.
Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.
This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could
Call get_historical_features
to generate a training dataframe
Iterate in notebook on feature engineering in Pandas
Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository
Verify with get_historical_features
(on a small dataset) that the transformation gives expected output over historical data
Verify with get_online_features
on dev branch that the transformation correctly outputs online features
Submit a pull request to the staging / prod branches which impact production traffic
A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.
Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:
See here for a example of how to use stream feature views to register your own streaming data pipelines in Feast.
A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.
A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see this guide for more details.
Please see feature_store.yaml for configuring engines.
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp
provider supports BigQuery as an offline store and Datastore as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local
, gcp
, and aws
) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp
provider but use Redis as the online store instead of Datastore.
If the built-in providers are not sufficient, you can create your own custom provider. Please see this guide for more details.
Please see feature_store.yaml for configuring providers.
We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.
Don't see your offline store or online store of choice here? Check out our guides to make a custom one!
In order for a plugin integration to be highlighted, it must meet the following requirements:
The plugin must have tests. Ideally it would use the Feast universal tests (see this guide for an example), but custom tests are fine.
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).
In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:
The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.
If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.
Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize
command.
The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each entity key, only the latest feature values are stored. No historical values are stored.
Here is an example batch data source:
Once the above data source is materialized into Feast (using feast materialize
), the feature values will be stored as follows:
Features can also be written directly to the online store via push sources .
Feast uses a registry to store all applied Feast objects (e.g. Feature views, entities, etc). The registry exposes methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.
By default, Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS, or Azure).
The quickstart guides that use feast init
will use a registry on a local file system. To allow Feast to configure a remote file registry, you need to create a GCS / S3 bucket that Feast can understand:
However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
Users can specify the registry through a feature_store.yaml
config file, or programmatically. We often see teams preferring the programmatic approach because it makes notebook driven development very easy:
feature_store.yaml
fileInstantiating a FeatureStore
object can then point to this:
The list below contains the functionality that contributors are planning to develop for Feast.
We welcome contribution to all items in the roadmap!
Have questions about the roadmap? Go to the Slack channel to ask on #feast-development.
Data Sources
Offline Stores
Online Stores
Feature Engineering
Streaming
Deployments
Feature Serving
Feature Discovery and Governance
Initial demonstration of Snowflake as an offline+online store with Feast, using the Snowflake demo template.
In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store + materialization engine + online store.
Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.
Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.
The following files will automatically be created in your project folder:
feature_store.yaml -- This is your main configuration file
driver_repo.py -- This is your main feature definition file
test.py -- This is a file to test your feature store configuration
feature_store.yaml
Here you will see the information that you entered. This template will use Snowflake as the offline store, materialization engine, and the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.
test.py
test.py
Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply
. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize
which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset that can be used for training models.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
A complete Feast deployment contains the following components:
Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
Manage version controlled feature definitions.
Materialize (load) feature values into the online store.
Build and retrieve training datasets from the offline store.
Retrieve online features.
Stream Processor: The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
Java and Go Clients are also available for online feature retrieval.
Alternatively, a can be used for a more scalable registry.
This supports any SQLAlchemy compatible database as a backend. The exact schema can be seen in
We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD automatically stays synced with the registry. Users will often also want multiple registries to correspond to different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write access since they can impact real user traffic. See for details on how to set this up.
Kafka / Kinesis sources (via )
On-demand Transformations (Alpha release. See )
Streaming Transformations (Alpha release. See )
Batch transformation (In progress. See )
AWS Lambda (Alpha release. See )
Kubernetes (See )
Data Quality Management (See )
Amundsen integration (see )
DataHub integration (see )
Feast Web UI (Beta release. See )
Create Stream Features: Stream features are created from streaming services such as Kafka or Kinesis, and can be pushed directly into Feast via the .
Batch Materialization Engine: The component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through .
Install Feast using pip:
Install Feast with Snowflake dependencies (required when using Snowflake):
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
Install Feast with AWS dependencies (required when using Redshift or DynamoDB):
Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):
Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.
Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.
Start by defining the feature references (e.g., driver_trips:average_daily_rides
) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.
3. Create an entity dataframe
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp
and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.
It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.
Pandas:
In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp
column and a driver_id
entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.
SQL (Alternative):
Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).
4. Launch historical retrieval
Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features()
. This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df()
.
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.
Feast is highly pluggable and configurable:
One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.
The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.
Below are some guides on how to add new custom components:
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
Feast batch materialization operations (materialize
and materialize-incremental
) execute through a BatchMaterializationEngine
.
Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include:
Setting up custom materialization-specific infrastructure during feast apply
(e.g. setting up Spark clusters or Lambda Functions)
Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda)
Tearing down custom materialization-specific infrastructure during feast teardown
(e.g. tearing down Spark clusters, or deleting Lambda Functions)
The fastest way to add custom logic to Feast is to extend an existing materialization engine. The most generic engine is the LocalMaterializationEngine
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom materialization engine class. We've created the MyCustomEngine
below.
Notice how in the above engine we have only overwritten two of the methods on the LocalMaterializatinEngine
, namely update
and materialize
. These two methods are convenient to replace if you are planning to launch custom batch jobs.
Notice how the batch_engine
field above points to the module and class where your engine can be found.
Now you should be able to use your engine by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom engine!
Feast comes with built-in materialization engines, e.g, LocalMaterializationEngine
, and an experimental LambdaMaterializationEngine
. However, users can develop their own materialization engines by creating a class that implements the contract in the .
Configure your file to point to your new engine class:
This guide will go over:
how Feast tests are setup
how to extend the test suite to test new functionality
how to use the existing test suite to test a new custom offline / online store
Unit tests are contained in sdk/python/tests/unit
. Integration tests are contained in sdk/python/tests/integration
. Let's inspect the structure of sdk/python/tests/integration
:
feature_repos
has setup files for most tests in the test suite.
conftest.py
(in the parent directory) contains the most common fixtures, which are designed as an abstraction on top of specific offline/online stores, so tests do not need to be rewritten for different stores. Individual test files also contain more specific fixtures.
The tests are organized by which Feast component(s) they test.
The universal feature repo refers to a set of fixtures (e.g. environment
and universal_data_sources
) that can be parametrized to cover various combinations of offline stores, online stores, and providers. This allows tests to run against all these various combinations without requiring excess code. The universal feature repo is constructed by fixtures in conftest.py
with help from the various files in feature_repos
.
Tests in Feast are split into integration and unit tests. If a test requires external resources (e.g. cloud resources on GCP or AWS), it is an integration test. If a test can be run purely locally (where locally includes Docker resources), it is a unit test.
Integration tests test non-local Feast behavior. For example, tests that require reading data from BigQuery or materializing data to DynamoDB are integration tests. Integration tests also tend to involve more complex Feast functionality.
Unit tests test local Feast behavior. For example, tests that only require registering feature views are unit tests. Unit tests tend to only involve simple Feast functionality.
E2E tests
E2E tests test end-to-end functionality of Feast over the various codepaths (initialize a feature store, apply, and materialize).
The main codepaths include:
basic e2e tests for offline stores
test_universal_e2e.py
go feature server
test_go_feature_server.py
python http server
test_python_feature_server.py
usage tracking
test_usage_e2e.py
data quality monitoring feature validation
test_validation.py
Offline and Online Store Tests
Offline and online store tests mainly test for the offline and online retrieval functionality.
The various specific functionalities that are tested include:
push API tests
test_push_features_to_offline_store.py
test_push_features_to_online_store.py
test_offline_write.py
historical retrieval tests
test_universal_historical_retrieval.py
online retrieval tests
test_universal_online.py
data quality monitoring feature logging tests
test_feature_logging.py
online store tests
test_universal_online.py
Registration Tests
The registration folder contains all of the registry tests and some universal cli tests. This includes:
CLI Apply and Materialize tests tested against on the universal test suite
Data type inference tests
Registry tests
Miscellaneous Tests
AWS Lambda Materialization Tests (Currently do not work)
test_lambda.py
Registry Diff Tests
These are tests for the infrastructure and registry diff functionality that Feast uses to determine if changes to the registry or infrastructure is needed.
Local CLI Tests and Local Feast Tests
These tests test all of the cli commands against the local file offline store.
Infrastructure Unit Tests
DynamoDB tests with dynamo mocked out
Repository configuration tests
Schema inference unit tests
Key serialization tests
Basic provider unit tests
Feature Store Validation Tests
These test mainly contain class level validation like hashing tests, protobuf and class serialization, and error and warning handling.
Data source unit tests
Feature service unit tests
Feature service, feature view, and feature validation tests
Protobuf/json tests for Feast ValueTypes
Serialization tests
Type mapping
Feast types
Serialization tests due to this issue
Feast usage tracking unit tests
Docstring tests are primarily smoke tests to make sure imports and setup functions can be executed without errors.
Let's look at a sample test using the universal repo:
The key fixtures are the environment
and universal_data_sources
fixtures, which are defined in the feature_repos
directories and the conftest.py
file. This by default pulls in a standard dataset with driver and customer entities (that we have pre-defined), certain feature views, and feature values.
The environment
fixture sets up a feature store, parametrized by the provider and the online/offline store. It allows the test to query against that feature store without needing to worry about the underlying implementation or any setup that may be involved in creating instances of these datastores.
Each fixture creates a different integration test with its own IntegrationTestRepoConfig
which is used by pytest to generate a unique test testing one of the different environments that require testing.
Feast tests also use a variety of markers:
The @pytest.mark.integration
marker is used to designate integration tests which will cause the test to be run when you call make test-python-integration
.
The @pytest.mark.universal_offline_stores
marker will parametrize the test on all of the universal offline stores including file, redshift, bigquery and snowflake.
The full_feature_names
parametrization defines whether or not the test should reference features as their full feature name (fully qualified path) or just the feature name itself.
Use the same function signatures as an existing test (e.g. use environment
and universal_data_sources
as an argument) to include the relevant test fixtures.
If possible, expand an individual test instead of writing a new test, due to the cost of starting up offline / online stores.
Use the universal_offline_stores
and universal_online_store
markers to parametrize the test against different offline store and online store combinations. You can also designate specific online and offline stores to test by using the only
parameter on the marker.
Install Feast in editable mode with pip install -e
.
The core tests for offline / online store behavior are parametrized by the FULL_REPO_CONFIGS
variable defined in feature_repos/repo_configuration.py
. To overwrite this variable without modifying the Feast repo, create your own file that contains a FULL_REPO_CONFIGS
(which will require adding a new IntegrationTestRepoConfig
or two) and set the environment variable FULL_REPO_CONFIGS_MODULE
to point to that file. Then the core offline / online store tests can be run with make test-python-universal
.
See the custom offline store demo and the custom online store demo for examples.
Many problems arise when implementing your data store's type conversion to interface with Feast datatypes.
You will need to correctly update inference.py
so that Feast can infer your datasource schemas
You also need to update type_map.py
so that Feast knows how to convert your datastores types to Feast-recognized types in feast/types.py
.
The most important functionality in Feast is historical and online retrieval. Most of the e2e and universal integration test test this functionality in some way. Making sure this functionality works also indirectly asserts that reading and writing from your datastore works as intended.
Extend data_source_creator.py
for your offline store.
In repo_configuration.py
add a new IntegrationTestRepoConfig
or two (depending on how many online stores you want to test).
Generally, you should only need to test against sqlite. However, if you need to test against a production online store, then you can also test against Redis or dynamodb.
Run the full test suite with make test-python-integration.
This folder is for plugins that are officially maintained with community owners. Place the APIs in feast/infra/offline_stores/contrib/
.
Extend data_source_creator.py
for your offline store and implement the required APIs.
In contrib_repo_configuration.py
add a new IntegrationTestRepoConfig
(depending on how many online stores you want to test).
Run the test suite on the contrib test suite with make test-python-contrib-universal
.
In repo_configuration.py
add a new config that maps to a serialized version of configuration you need in feature_store.yaml
to setup the online store.
In repo_configuration.py
, add new IntegrationTestRepoConfig
for online stores you want to test.
Run the full test suite with make test-python-integration
Check test_universal_types.py
for an example of how to do this.
Install Redis on your computer. If you are a mac user, you should be able to brew install redis
.
Running redis-server --help
and redis-cli --help
should show corresponding help menus.
Run ./infra/scripts/redis-cluster.sh start
then ./infra/scripts/redis-cluster.sh create
to start the Redis cluster locally. You should see output that looks like this:
You should be able to run the integration tests and have the Redis cluster tests pass.
If you would like to run your own Redis cluster, you can run the above commands with your own specified ports and connect to the newly configured cluster.
To stop the cluster, run ./infra/scripts/redis-cluster.sh stop
and then ./infra/scripts/redis-cluster.sh clean
.
All Feast operations execute through a provider
. Operations like materializing data from the offline to the online store, updating infrastructure like databases, launching streaming ingestion jobs, building training datasets, and reading features from the online store.
Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include:
Launching custom streaming ingestion jobs (Spark, Beam)
Launching custom batch ingestion (materialization) jobs (Spark, Beam)
Adding custom validation to feature repositories during feast apply
Adding custom infrastructure setup logic which runs during feast apply
Extending Feast commands with in-house metrics, logging, or tracing
The fastest way to add custom logic to Feast is to extend an existing provider. The most generic provider is the LocalProvider
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the provider methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom provider class. We've created the MyCustomProvider
below.
Notice how in the above provider we have only overwritten two of the methods on the LocalProvider
, namely update_infra
and materialize_single_feature_view
. These two methods are convenient to replace if you are planning to launch custom batch or streaming jobs. update_infra
can be used for launching idempotent streaming jobs, and materialize_single_feature_view
can be used for launching batch ingestion jobs.
Notice how the provider
field above points to the module and class where your provider can be found.
Now you should be able to use your provider by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom provider!
Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.
Using a table name:
Using a query:
Feast comes with built-in providers, e.g, LocalProvider
, GcpProvider
, and AwsProvider
. However, users can develop their own providers by creating a class that implements the contract in the .
This guide also comes with a fully functional . Please have a look at the repository for a representative example of what a custom provider looks like, or fork the repository when creating your own provider.
It is possible to overwrite all the methods on the provider class. In fact, it isn't even necessary to subclass an existing provider like LocalProvider
. The only requirement for the provider class is that it follows the .
Configure your file to point to your new provider class:
Have a look at the for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider!
The full set of configuration options is available .
Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .
Be careful about how Snowflake handles table and column name conventions. In particular, you can read more about quote identifiers .
The full set of configuration options is available .
Snowflake data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .
Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.
The Spark data source does not achieve full test coverage. Please do not assume complete stability.
Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):
Using a query:
Using a file reference:
The full set of configuration options is available here.
Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.
The file offline store provides support for reading . It uses Dask as the compute engine.
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the file offline store.
To compare this set of functionality against other offline stores, please see the full .
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
yes
write_logged_features
(persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
The Redshift offline store provides support for reading RedshiftSources.
All joins happen within Redshift.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Redshift temporarily in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[aws]'
. You can get started by then running feast init -t aws
.
The full set of configuration options is available in RedshiftOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Redshift offline store.
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
yes
write_logged_features
(persist logged features to offline store)
yes
Below is a matrix indicating which functionality is supported by RedshiftRetrievalJob
.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
yes
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
yes
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
To compare this set of functionality against other offline stores, please see the full functionality matrix.
Feast requires the following permissions in order to execute commands for Redshift offline store:
Command
Permissions
Resources
Apply
redshift-data:DescribeTable
redshift:GetClusterCredentials
arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>
arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Materialize
redshift-data:ExecuteStatement
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Materialize
redshift-data:DescribeStatement
*
Materialize
s3:ListBucket
s3:GetObject
s3:DeleteObject
arn:aws:s3:::<bucket_name>
arn:aws:s3:::<bucket_name>/*
Get Historical Features
redshift-data:ExecuteStatement
redshift:GetClusterCredentials
arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>
arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Get Historical Features
redshift-data:DescribeStatement
*
Get Historical Features
s3:ListBucket
s3:GetObject
s3:PutObject
s3:DeleteObject
arn:aws:s3:::<bucket_name>
arn:aws:s3:::<bucket_name>/*
The following inline policy can be used to grant Feast the necessary permissions:
In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run UNLOAD and COPY commands. Once created, this IAM role needs to be configured in feature_store.yaml
file as offline_store: iam_role
.
The following inline policy can be used to grant Redshift necessary permissions to access S3:
While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:
The Trino offline store provides support for reading TrinoSources.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Trino as a table in order to complete join operations.
The Trino offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[trino]'
. You can then run feast init
, then swap out feature_store.yaml
with the below example to connect to Trino.
The full set of configuration options is available in TrinoOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Trino offline store.
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
no
write_logged_features
(persist logged features to offline store)
no
Below is a matrix indicating which functionality is supported by TrinoRetrievalJob
.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
no
preview the query plan before execution
yes
read partitioned data
yes
To compare this set of functionality against other offline stores, please see the full functionality matrix.