Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Feast (Feature Store) is an open-source feature store that helps teams operate production ML systems at scale by allowing them to define, manage, validate, and serve features for production AI/ML.
Feast's feature store is composed of two foundational components: (1) an offline store for historical feature extraction used in model training and an (2) online store for serving features at low-latency in production systems and applications.
Feast is a configurable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models. For more details please review our architecture.
Concretely, Feast provides:
A python SDK for programtically defining features, entities, sources, and (optionally) transformations
A python SDK for for reading and writing features to configured offline and online data stores
An optional feature server for reading and writing features (useful for non-python languages)
A UI for viewing and exploring information about features defined in the project
A CLI tool for viewing and updating feature information
Feast allows ML platform teams to:
Make features consistently available for training and low-latency serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
Note: Feast today primarily addresses timestamped structured data.
Note: Feast uses a push model for online serving. This means that the feature store pushes feature values to the online store, which reduces the latency of feature retrieval. This is more efficient than a pull model, where the model serving system must make a request to the feature store to retrieve feature values. See this document for a more detailed discussion.
Feast helps ML platform/MLOps teams with DevOps experience productionize real-time models. Feast also helps these teams build a feature platform that improves collaboration between data engineers, software engineers, machine learning engineers, and data scientists.
Feast is likely not the right tool if you
are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is
an ETL / ELT system. Feast is not a general purpose data pipelining system. Users often leverage tools like dbt to manage upstream data transformations. Feast does support some transformations.
a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like Airflow to make features consistently available.
a data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
a database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time
batch feature engineering: Feast supports on demand and streaming transformations. Feast is also investing in supporting batch transformations.
native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines.
data quality / drift detection: Feast has experimental integrations with Great Expectations, but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.
Many companies have used Feast to power real-world ML use cases such as:
Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch
Credit scoring, using pre-computed historical features to compute probability of default
The best way to learn Feast is to use it. Head over to our Quickstart and try it out!
Explore the following resources to get started with Feast:
Quickstart is the fastest way to get started with Feast
Concepts describes all important Feast API concepts
Architecture describes Feast's overall architecture.
Tutorials shows full examples of using Feast in machine learning applications.
Running Feast with Snowflake/GCP/AWS provides a more in-depth guide to using Feast.
Reference contains detailed API and design documents.
Contributing contains resources for anyone who wants to contribute to Feast.
Use Python to serve your features.
Python has emerged as the primary language for machine learning, and this extends to feature serving and there are five main reasons Feast recommends using a microservice written in Python.
You should meet your users where they are. Python’s popularity in the machine learning community is undeniable. Its simplicity and readability make it an ideal language for writing and understanding complex algorithms. Python boasts a rich ecosystem of libraries such as TensorFlow, PyTorch, XGBoost, and scikit-learn, which provide robust support for developing and deploying machine learning models and we want Feast in this ecosystem.
Precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Precomputation ensures product experiences for downstream services are also fast. Slow user experiences are bad user experiences. Precompute and persist data as much as you can.
Ensuring that features used during model training (offline serving) and online serving are available in production to make real-time predictions is critical. When features are initially developed, they are typically written in Python. This is due to the convenience and efficiency provided by Python's data manipulation libraries. However, in a production environment, there is often interest or pressure to rewrite these features in a different language, like Java, Go, or C++, for performance reasons. This reimplementation introduces a significant risk: training and serving skew. Note that there will always be some minor exceptions (e.g., any Time Since Last Event types of features) but this should not be the rule.
Training and serving skew occurs when there are discrepancies between the features used during model training and those used during prediction. This can lead to degraded model performance, unreliable predictions, and reduced velocity in releasing new features and new models. The process of rewriting features in another language is prone to errors and inconsistencies, which exacerbate this issue.
Rewriting features in another language is not only risky but also resource-intensive. It requires significant time and effort from engineers to ensure that the features are correctly translated. This process can introduce bugs and inconsistencies, further increasing the risk of training and serving skew. Additionally, maintaining two versions of the same feature codebase adds unnecessary complexity and overhead. More importantly, the opportunity cost of this work is high and requires twice the amount of resourcing. Reimplementing code should only be done when the performance gains are worth the investment. Features should largely be precomputed so the latency performance gains should not be the highest impact work that your team can accomplish.
Rather than switching languages, it is more efficient to optimize the performance of your feature store while keeping Python as the primary language. Optimization is a two step process.
Use tools like CProfile to understand latency bottlenecks in your code. This will help you prioritize the biggest inefficiencies first. When we initially launched Python native transformations in Python, profiling the code helped us identify that Pandas resulted in a 10x overhead due to type conversion.
As mentioned, precomputation is the recommended path. In some cases, you may want fully synchronous writes from your data producer to your online feature store, in which case you will want your feature computations and writes to be very fast. In this case, we recommend optimizing the feature calculation code first.
You should optimize your code using libraries, tools, and caching. For example, identify whether your feature calculations can be optimized through vectorized calculations in NumPy; explore tools like Numba for faster execution; and cache frequently accessed data using tools like an lru_cache.
Lastly, Feast will continue to optimize serving in Python and making the overall infrastructure more performant. This will better serve the community.
So we recommend focusing on optimizing your feature-specific code, reporting latency bottlenecks to the maintainers, and contributing to help the infrastructure be more performant.
By keeping features in Python and optimizing performance, you can ensure consistency between training and serving, reduce the risk of errors, and focus on launching more product experiences for your customers.
Embrace Python for feature serving, and leverage its strengths to build robust and reliable machine learning systems.
Feast uses a Push Model, i.e., Data Producers push data to the feature store and Feast stores the feature values in the online store, to serve features in real-time.
In a Pull Model, Feast would pull data from the data producers at request time and store the feature values in the online store before serving them (storing them would actually be unnecessary). This approach would incur additional network latency as Feast would need to orchestrate a request to each data producer, which would mean the latency would be at least as long as your slowest call. So, in order to serve features as fast as possible, we push data to Feast and store the feature values in the online store.
The trade-off with the Push Model is that strong consistency is not guaranteed out of the box. Instead, strong consistency has to be explicitly designed for in orchestrating the updates to Feast and the client usage.
The significant advantage with this approach is that Feast is read-optimized for low-latency feature retrieval.
Implicit in the Push model are decisions about how and when to push feature values to the online store.
From a developer's perspective, there are three ways to push feature values to the online store with different tradeoffs.
They are discussed further in the Write Patterns section.
Feast (Feature Store) is an open-source feature store designed to facilitate the management and serving of machine learning features in a way that supports both batch and real-time applications.
For Data Scientists: Feast is a a tool where you can easily define, store, and retrieve your features for both model development and model deployment. By using Feast, you can focus on what you do best: build features that power your AI/ML models and maximize the value of your data.
For MLOps Engineers: Feast is a library that allows you to connect your existing infrastructure (e.g., online database, application server, microservice, analytical database, and orchestration tooling) that enables your Data Scientists to ship features for their models to production using a friendly SDK without having to be concerned with software engineering challenges that occur from serving real-time production systems. By using Feast, you can focus on maintaining a resilient system, instead of implementing features for Data Scientists.
For Data Engineers: Feast provides a centralized catalog for storing feature definitions allowing one to maintain a single source of truth for feature data. It provides the abstraction for reading and writing to many different types of offline and online data stores. Using either the provided python SDK or the feature server service, users can write data to the online and/or offline stores and then read that data out again in either low-latency online scenarios for model inference, or in batch scenarios for model training.
For more info refer to Introduction to feast
Ensure that you have Python (3.9 or above) installed.
It is recommended to create and work in a virtual environment:
In this tutorial we will:
Deploy a local feature store with a Parquet file offline store and Sqlite online store.
Build a training dataset using our time series features from our Parquet files.
Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.
Read the latest features from the offline store for batch scoring
Read the latest features from the online store for real-time inference.
Explore the (experimental) Feast UI
Note - Feast provides a python SDK as well as an optional hosted service for reading and writing feature data to the online and offline data stores. The latter might be useful when non-python languages are required.
For this tutorial, we will be using the python SDK.
In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:
Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.
Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other data sources.
Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.
Feature and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).
(Experimental) Feast enables light-weight feature transformations so users can re-use transformation logic across online / offline use cases and across models.
Install the Feast SDK and CLI using pip:
In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see Running Feast with Snowflake/GCP/AWS
Bootstrap a new feature repository using feast init
from the command line.
Let's take a look at the resulting demo repo itself. It breaks down into
data/
contains raw demo parquet data
example_repo.py
contains demo feature definitions
feature_store.yaml
contains a demo setup configuring where data sources are
test_workflow.py
showcases how to run all key Feast commands, including defining, retrieving, and pushing features. You can run this with python test_workflow.py
.
The feature_store.yaml
file configures the key overall architecture of the feature store.
The provider value sets default offline and online stores.
The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).
The online store is a low latency store of the latest feature values (for powering real-time inference).
Valid values for provider
in feature_store.yaml
are:
local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store
gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)
aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)
Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See Third party integrations for all supported data sources.
A custom setup can also be made by following Customizing Feast.
The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.
There's an included test_workflow.py
file which runs through a full sample workflow:
Register feature definitions through feast apply
Generate a training dataset (using get_historical_features
)
Generate features for batch scoring (using get_historical_features
)
Ingest batch features into an online store (using materialize_incremental
)
Fetch online features to power real time inference (using get_online_features
)
Ingest streaming features into offline / online stores (using push
)
Verify online features are updated / fresher
We'll walk through some snippets of code below and explain
The apply
command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py
and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store
in feature_store.yaml
.
To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.
Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:
The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.
The user can also query that table with a SQL query which pulls entities. See the documentation on feature retrieval for details
Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.
To power a batch model, we primarily need to generate features with the get_historical_features
call, but using the current timestamp
We now serialize the latest values of features since the beginning of time to prepare for serving. Note, materialize_incremental
serializes all new features since the last materialize
call, or since the time provided minus the ttl
timedelta. In this case, this will be CURRENT_TIME - 1 day
(ttl
was set on the FeatureView
instances in feature_repo/feature_repo/example_repo.py).
At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features()
. These feature vectors can then be fed to the model.
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found here.
The driver_activity_v1
feature service pulls all features from the driver_hourly_stats
feature view:
View all registered features, data sources, entities, and feature services with the Web UI.
One of the ways to view this is with the feast ui
command.
test_workflow.py
Take a look at test_workflow.py
again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.
Read the Concepts page to understand the Feast data model.
Read the Architecture page.
Check out our Tutorials section for more examples on how to use Feast.
Follow our Running Feast with Snowflake/GCP/AWS guide for a more in-depth tutorial on using Feast.
Role-Based Access Control (RBAC) is a security mechanism that restricts access to resources based on the roles of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups can access or modify specific resources, thereby maintaining data security and operational integrity.
The RBAC implementation in Feast is designed to:
Assign Permissions: Allow administrators to assign permissions for various operations and resources to users or groups based on their roles.
Seamless Integration: Integrate smoothly with existing business code without requiring significant modifications.
Backward Compatibility: Maintain support for non-authorized models as the default to ensure backward compatibility.
The primary business goals of implementing RBAC in the Feast are:
Feature Sharing: Enable multiple teams to share the feature store while ensuring controlled access. This allows for collaborative work without compromising data security.
Access Control Management: Prevent unauthorized access to team-specific resources and spaces, governing the operations that each user or group can perform.
Feast operates as a collection of connected services, each enforcing authorization permissions. The architecture is designed as a distributed microservices system with the following key components:
Service Endpoints: These enforce authorization permissions, ensuring that only authorized requests are processed.
Client Integration: Clients authenticate with feature servers by attaching authorization token to each request.
Service-to-Service Communication: This is always granted.
The RBAC system in Feast uses a permission model that defines the following concepts:
Resource: An object within Feast that needs to be secured against unauthorized access.
Action: A logical operation performed on a resource, such as Create, Describe, Update, Delete, Read, or write operations.
Policy: A set of rules that enforce authorization decisions on resources. The default implementation uses role-based policies.
The authorization architecture in Feast is built with the following components:
Token Extractor: Extracts the authorization token from the request header.
Token Parser: Parses the token to retrieve user details.
Policy Enforcer: Validates the secured endpoint against the retrieved user details.
Token Injector: Adds the authorization token to each secured request header.
Note: this ML Infrastructure diagram highlights an orchestration pattern that is driven by a client application. This is not the only approach that can be taken and different patterns will result in different trade-offs.
Production machine learning systems can choose from four approaches to serving machine learning predictions (the output of model inference):
Online model inference with online features
Offline mode inference without online features
Online model inference with online features and cached predictions
Online model inference without features
Note: online features can be sourced from batch, streaming, or request data sources.
These three approaches have different tradeoffs but, in general, have significant implementation differences.
Online model inference with online features is a powerful approach to serving data-driven machine learning applications. This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe). This approach is particularly useful for applications where request-time data is required to run inference.
Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement. This approach simply treats the model predictions as a feature and serves them from the feature store using the standard Feast sdk. These model predictions are typically generated through some batch process where the model scores are precomputed. As a concrete example, the batch process can be as simple as a script that runs model inference locally for a set of users that can output a CSV. This output file could be used for materialization so that the model could be served online as shown in the code below.
Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and materialized to the online store.
While this approach can lead to quick impact for different business use cases, it suffers from stale data as well as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff may be tolerable.
This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running model inference when data producers write features to the online store. This approach is particularly useful for applications where features are coming from multiple data sources, the model is computationally expensive to run, or latency is a significant constraint.
Note that in this case a seperate call to write_to_online_store
is required when the underlying data changes and predictions change along with it.
While this requires additional writes for every data producer, this approach will result in the lowest latency for model inference.
This approach does not require Feast. The model server can directly serve predictions without any features. This approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions.
Note that generative models using Retrieval Augmented Generation (RAG) do require features where the document embeddings are treated as features, which Feast supports (this would fall under "Online Model Inference with Online Features").
Implicit in the code examples above is a design choice about how clients orchestrate calls to get features and run model inference. The examples had a Feast-centric pattern because they are inputs to the model, so the sequencing is fairly obvious. An alternative approach can be Inference-centric where a client would call an inference endpoint and the inference service would be responsible for orchestration.
As a part of the Linux Foundation, we ask community members to adhere to the
: Find the complete Feast codebase on GitHub.
: See the governance model of Feast, including who the maintainers are and how decisions are made.
: This folder is used as a central repository for all Feast resources. For example:
Design proposals in the form of Request for Comments (RFC).
User surveys and meeting minutes.
Slide decks of conferences our contributors have spoken at.
: Our LFAI wiki page contains links to resources for contributors and maintainers.
GitHub Issues: Found a bug or need a feature? .
A feature transformation is a function that takes some set of input data and returns some set of output data. Feature transformations can happen on either raw data or derived data.
Feature transformations can be executed by three types of "transformation engines":
The Feast Feature Server
An Offline Store (e.g., Snowflake, BigQuery, DuckDB, Spark, etc.)
A Stream processor (e.g., Flink or Spark Streaming)
The three transformation engines are coupled with the .
Importantly, this implies that different feature transformation code may be used under different transformation engines, so understanding the tradeoffs of when to use which transformation engine/communication pattern is extremely critical to the success of your implementation.
In general, we recommend transformation engines and network calls to be chosen by aligning it with what is most appropriate for the data producer, feature/model usage, and overall product.
Feast's architecture is designed to be flexible and scalable. It is composed of several components that work together to provide a feature store that can be used to serve features for training and inference.
Feast uses a to ingest data from different sources and store feature values in the online store. This allows Feast to serve features in real-time with low latency.
Feast supports for On Demand and Streaming data sources and will support Batch transformations in the future. For Streaming and Batch data sources, Feast requires a separate (in the batch case, this is typically your Offline Store). We are exploring adding a default streaming engine to Feast.
Domain expertise is recommended when integrating a data source with Feast understand the to your application
We recommend for your Feature Store microservice. As mentioned in the document, precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Because of this we believe the pros of Python outweigh the costs, as reimplementing feature logic is undesirable. Java and Go Clients are also available for online feature retrieval.
is a security mechanism that restricts access to resources based on the roles of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups can access or modify specific resources, thereby maintaining data security and operational integrity.
Feast uses a to push features to the online store.
This has two important consequences: (1) communication patterns between the Data Producer (i.e., the client) and Feast (i.e,. the server) and (2) feature computation and feature value write patterns to Feast's online store.
Data Producers (i.e., services that generate data) send data to Feast so that Feast can write feature values to the online store. That data can be either raw data where Feast computes and stores the feature values or precomputed feature values.
There are two ways a client (or Data Producer) can send data to the online store:
Synchronously
Using a synchronous API call for a small number of entities or a single entity (e.g., using the ) or the Feature Server's )
Asynchronously
Using an asynchronous API call for a small number of entities or a single entity (e.g., using the ) or the Feature Server's )
Using a "batch job" for a large number of entities (e.g., using a )
Note, in some contexts, developers may "batch" a group of entities together and write them to the online store in a single API call. This is a common pattern when writing data to the online store to reduce write loads but we would not qualify this as a batch job.
Writing feature values to the online store (i.e., the server) can be done in two ways: Precomputing the transformations client-side or Computing the transformations On Demand server-side.
In some scenarios, a combination of Precomputed and On Demand transformations may be optimal.
When selecting feature value write patterns, one must consider the specific requirements of your application, the acceptable correctness of the data, the latency tolerance, and the computational resources available. Making deliberate choices can help the performance and reliability of your service.
There are two ways the client can write feature values to the online store:
Precomputing transformations
Computing transformations On Demand
Hybrid (Precomputed + On Demand)
Precomputed transformations can happen outside of Feast (e.g., via some batch job or streaming application) or inside of the Feast feature server when writing to the online store via the push
or write-to-online-store
api.
On Demand transformations can only happen inside of Feast at either (1) the time of the client's request or (2) when the data producer writes to the online store.
The hybrid approach allows for precomputed transformations to happen inside or outside of Feast and have the On Demand transformations happen at client request time. This is particularly convenient for "Time Since Last" types of features (e.g., time since purchase).
When deciding between synchronous and asynchronous data writes, several tradeoffs should be considered:
Data Consistency: Asynchronous writes allow Data Producers to send data without waiting for the write operation to complete, which can lead to situations where the data in the online store is stale. This might be acceptable in scenarios where absolute freshness is not critical. However, for critical operations, such as calculating loan amounts in financial applications, stale data can lead to incorrect decisions, making synchronous writes essential.
Correctness: The risk of data being out-of-date must be weighed against the operational requirements. For instance, in a lending application, having up-to-date feature data can be crucial for correctness (depending upon the features and raw data), thus favoring synchronous writes. In less sensitive contexts, the eventual consistency offered by asynchronous writes might be sufficient.
Service Coupling: Synchronous writes result in tighter coupling between services. If a write operation fails, it can cause the dependent service operation to fail as well, which might be a significant drawback in systems requiring high reliability and independence between services.
Application Latency: Asynchronous writes typically reduce the perceived latency from the client's perspective because the client does not wait for the write operation to complete. This can enhance the user experience and efficiency in environments where operations are not critically dependent on immediate data freshness.
The table below can help guide the most appropriate data write and feature computation strategies based on specific application needs and data sensitivity.
Data Write Type | Feature Computation | Scenario | Recommended Approach |
---|
Asynchronous | On Demand | Data-intensive applications tolerant to staleness | Opt for asynchronous writes with on-demand computation to balance load and manage resource usage efficiently. |
Asynchronous | Precomputed | High volume, non-critical data processing | Use asynchronous batch jobs with precomputed transformations for efficiency and scalability. |
Synchronous | On Demand | High-stakes decision making | Use synchronous writes with on-demand feature computation to ensure data freshness and correctness. |
Synchronous | Precomputed | User-facing applications requiring quick feedback | Use synchronous writes with precomputed features to reduce latency and improve user experience. |
Synchronous | Hybrid (Precomputed + On Demand) | High-stakes decision making that want to optimize for latency under constraints | Use synchronous writes with precomputed features where possible and a select set of on demand computations to reduce latency and improve user experience. |
Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.
A feature view is defined as a collection of features.
In the online settings, this is a stateful collection of features that are read when the get_online_features
method is called.
In the offline setting, this is a stateless collection of features that are created when the get_historical_features
method is called.
A feature view is an object representing a logical group of time-series feature data as it is found in a data source. Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see [Beta] On demand feature views).
Feature views consist of:
zero or more entities
If the features are not related to a specific object, the feature view might not have entities; see feature views without entities below.
a name to uniquely identify this feature view in the project.
(optional, but recommended) a schema specifying one or more features (without this, Feast will infer the schema by reading from the data source)
(optional, but recommended) metadata (for example, description, or other free-form metadata via tags
)
(optional) a TTL, which limits how far back Feast will look when generating historical datasets
Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.
Feature views are used during
The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from stream sources.
Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).
If the schema
parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply
by creating a Field
for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.
"Entity aliases" can be specified to join entity_dataframe
columns that do not match the column names in the source table of a FeatureView.
This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
It is suggested that you dynamically specify the new FeatureView name using .with_name
and join_key_map
override using .with_join_key_map
instead of needing to register each new copy.
A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer
entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py
.
Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:
Together with data sources, they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using feature references.
Feature names must be unique within a feature view.
Each field can have additional metadata associated with it, specified as key-value tags.
On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.
Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.
This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could
Call get_historical_features
to generate a training dataframe
Iterate in notebook on feature engineering in Pandas
Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository
Verify with get_historical_features
(on a small dataset) that the transformation gives expected output over historical data
Verify with get_online_features
on dev branch that the transformation correctly outputs online features
Submit a pull request to the staging / prod branches which impact production traffic
A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.
Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:
See here for a example of how to use stream feature views to register your own streaming data pipelines in Feast.
Generally, Feast supports several patterns of feature retrieval:
Training data generation (via feature_store.get_historical_features(...)
)
Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...)
)
Online feature retrieval for real-time model predictions
via the SDK: feature_store.get_online_features(...)
via deployed feature server endpoints: requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))
Each of these retrieval mechanisms accept:
some way of specifying entities (to fetch features for)
some way to specify the features to fetch (either via feature services, which group features needed for a model version, or feature references)
Before beginning, you need to instantiate a local FeatureStore
object that knows how to parse the registry (see more details)
For code examples of how the below work, inspect the generated repository from feast init -t [YOUR TEMPLATE]
(gcp
, snowflake
, and aws
are the most fully fleshed).
Before diving into how to retrieve features, we need to understand some high level concepts in Feast.
A feature service is an object that represents a logical group of features from one or more feature views. Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.
Feature services are used during
The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now()
)
Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.
Applying a feature service does not result in an actual service being deployed.
Feature services enable referencing all or some features from a feature view.
Retrieving from the online store with a feature service
Retrieving from the offline store with a feature service
This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.
Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>
Feature references are used for the retrieval of features from Feast:
It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time.
Note, if you're using Feature views without entities, then those features can be added here without additional entity values in the entity_rows
parameter.
The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.
Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.
A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.
Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.
Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.
Feast abstracts away point-in-time join complexities with the get_historical_features
API.
We go through the major steps, and also show example code. Note that the quickstart templates generally have end-to-end working examples for all these cases.
Feast accepts either:
feature services, which group features needed for a model version
Feast accepts either a Pandas dataframe as the entity dataframe (including entity keys and timestamps) or a SQL query to generate the entities.
Both approaches must specify the full entity key needed as well as the timestamps. Feast then joins features onto this dataframe.
You can also pass a SQL string to generate the above dataframe. This is useful for getting all entities in a timeframe from some data source.
Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of entities and the corresponding features to be retrieved. Similar to get_historical_features
, we recommend using feature services as a mechanism for grouping features in a model version.
Note: unlike get_historical_features
, the entity_rows
do not need timestamps since you only want one feature value per entity key.
There are several options for retrieving online features: through the SDK, or through a feature server
The Feast permissions model allows to configure granular permission policies to all the resources defined in a feature store.
The configured permissions are stored in the Feast registry and accessible through the CLI and the registry APIs.
The permission authorization enforcement is performed when requests are executed through one of the Feast (Python) servers
The online feature server (REST)
The offline feature server (Arrow Flight)
The registry server (gRPC)
Note that there is no permission enforcement when accessing the Feast API with a local provider.
The permission model is based on the following components:
A resource
is a Feast object that we want to secure against unauthorized access.
We assume that the resource has a name
attribute and optional dictionary of associated key-value tags
.
An action
is a logical operation executed on the secured resource, like:
create
: Create an instance.
describe
: Access the instance state.
update
: Update the instance state.
delete
: Delete an instance.
read
: Read both online and offline stores.
read_online
: Read the online store.
read_offline
: Read the offline store.
write
: Write on any store.
write_online
: Write to the online store.
write_offline
: Write to the offline store.
A policy
identifies the rule for enforcing authorization decisions on secured resources, based on the current user.
A default implementation is provided for role-based policies, using the user roles to grant or deny access to the requested actions on the secured resources.
The Permission
class identifies a single permission configured on the feature store and is identified by these attributes:
name
: The permission name.
types
: The list of protected resource types. Defaults to all managed types, e.g. the ALL_RESOURCE_TYPES
alias. All sub-classes are included in the resource match.
name_patterns
: A list of regex patterns to match resource names. If any regex matches, the Permission
policy is applied. Defaults to []
, meaning no name filtering is applied.
required_tags
: Dictionary of key-value pairs that must match the resource tags. Defaults to None
, meaning that no tags filtering is applied.
actions
: The actions authorized by this permission. Defaults to ALL_VALUES
, an alias defined in the action
module.
policy
: The policy to be applied to validate a client request.
To simplify configuration, several constants are defined to streamline the permissions setup:
In module feast.feast_object
:
ALL_RESOURCE_TYPES
is the list of all the FeastObject
types.
ALL_FEATURE_VIEW_TYPES
is the list of all the feature view types, including those not inheriting from FeatureView
type like OnDemandFeatureView
.
In module feast.permissions.action
:
ALL_ACTIONS
is the list of all managed actions.
READ
includes all the read actions for online and offline store.
WRITE
includes all the write actions for online and offline store.
CRUD
includes all the state management actions to create, describe, update or delete a Feast resource.
Given the above definitions, the feature store can be configured with granular control over each resource, enabling partitioned access by teams to meet organizational requirements for service and data sharing, and protection of sensitive information.
The feast
CLI includes a new permissions
command to list the registered permissions, with options to identify the matching resources for each configured permission and the existing resources that are not covered by any permission.
Note: Feast resources that do not match any of the configured permissions are not secured by any authorization policy, meaning any user can execute any action on such resources.
This permission definition grants access to the resource state and the ability to read all of the stores for any feature view or feature service to all users with the role super-reader
:
This example grants permission to write on all the data sources with risk_level
tag set to high
only to users with role admin
or data_team
:
Note: When using multiple roles in a role-based policy, the user must be granted at least one of the specified roles.
The following permission grants authorization to read the offline store of all the feature views including risky
in the name, to users with role trusted
:
In order to leverage the permission functionality, the auth
section is needed in the feature_store.yaml
configuration. Currently, Feast supports OIDC and Kubernetes RBAC authorization protocols.
The default configuration, if you don't specify the auth
configuration section, is no_auth
, indicating that no permission enforcement is applied.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.
The concept of a "project" provide the following benefits:
Logical Grouping: Projects group related features together, making it easier to manage and track them.
Feature Definitions: Within a project, you can define features, including their metadata, types, and sources. This helps standardize how features are created and consumed.
Isolation: Projects provide a way to isolate different environments, such as development, testing, and production, ensuring that changes in one project do not affect others.
Collaboration: By organizing features within projects, teams can collaborate more effectively, with clear boundaries around the features they are responsible for.
Access Control: Projects can implement permissions, allowing different users or teams to access only the features relevant to their work.
A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.
Below is an example data source with a single entity column (driver
) and two feature columns (trips_today
, and rating
).
Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:
Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
[Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into on-demand feature views, which allow light-weight feature engineering and combining features across sources.
Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).
A key command to use in Feast is the materialize_incremental
command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
Materialization can be called programmatically or through the CLI:
If the schema
parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply
. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT
clause and inspecting the result.
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
To push data into the offline or online stores: see push sources for details.
(experimental) To use a contrib Spark processor to ingest from a topic, see Tutorial: Building streaming features
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.
Entities are used by Feast in many contexts, as we explore below:
Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:
zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)
Feast refers to this collection of entities for a feature view as an entity key.
Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.
Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.
At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch point-in-time correct features for to generate a training dataset.
At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).
Q: Can I retrieve features for all entities?
Kind of.
In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an open GitHub issue that welcomes contribution to make this a more intuitive API.
For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.
Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today
, and earnings_today
):
The above table can be registered with Feast through the following feature view:
Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.
Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats
feature view onto it, while preserving the trip_success
column:
The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:
For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time specified.
Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).
Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:
Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.
Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.
Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .
Dataset can be created from:
Results of historical retrieval
[planned] Logging request (including input for ) and response during feature serving
[planned] Logging features during writing to online store (from batch source or stream)
To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features
method first and then pass the returned retrieval job to create_saved_dataset
method. create_saved_dataset
will trigger the provided retrieval job (by calling .persist()
on it) to store the data using the specified storage
behind the scenes. Storage type must be the same as the globally configured offline store (e.g it's impossible to persist data to a different offline source). create_saved_dataset
will also create a SavedDataset
object with all of the related metadata and will write this object to the registry.
Saved dataset can be retrieved later using the get_saved_dataset
method in the feature store:
An offline store is an interface for working with historical time-series feature values that are stored in . The OfflineStore
interface has several different implementations, such as the BigQueryOfflineStore
, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see .
Offline stores are primarily used for two reasons:
Building training datasets from time-series features.
Materializing (loading) features into an online store to serve those features at low-latency in a production setting.
Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.
Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery
offline store cannot be used to query a file-based data source.
Please see for more details on how to push features directly to the offline store in your feature store.
Tags in Feast allow for efficient filtering of Feast objects when listing them in the UI, CLI, or querying the registry directly.
The way to define tags on the feast objects is through the definition file or directly in the object that will be applied to the feature store.
In this example we define a Feature View in a definition file that has a tag:
In this example we define a Stream Feature View that has a tag, in the code:
An example of filtering feature-views with the tag team:driver_performance
:
The same example of listing feature-views without tag filtering:
The Feast feature registry is a central catalog of all feature definitions and their related metadata. Feast uses the registry to store all applied Feast objects (e.g. Feature views, entities, etc). It allows data scientists to search, discover, and collaborate on new features. The registry exposes methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.
Feast comes with built-in file-based and sql-based registry implementations. By default, Feast uses a file-based registry, which stores the protobuf representation of the registry as a serialized file in the local file system. For more details on which registries are supported, please see .
We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD automatically stays synced with the registry. Users will often also want multiple registries to correspond to different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write access since they can impact real user traffic. See for details on how to set this up.
Users can specify the registry through a feature_store.yaml
config file, or programmatically. We often see teams preferring the programmatic approach because it makes notebook driven development very easy:
feature_store.yaml
fileInstantiating a FeatureStore
object can then point to this:
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp
provider supports as an offline store and as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local
, gcp
, and aws
) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp
provider but use Redis as the online store instead of Datastore.
If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.
Please see for configuring providers.
Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize
command.
The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each , only the latest feature values are stored. No historical values are stored.
Here is an example batch data source:
Once the above data source is materialized into Feast (using feast materialize
), the feature values will be stored as follows:
A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.
A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see for more details.
Please see for configuring engines.
Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Create Stream Features: Stream features are created from streaming services such as Kafka or Kinesis, and can be pushed directly into Feast via the .
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply
. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize
which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset that can be used for training models.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
A complete Feast deployment contains the following components:
Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
Manage version controlled feature definitions.
Materialize (load) feature values into the online store.
Build and retrieve training datasets from the offline store.
Retrieve online features.
Stream Processor: The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
Authorization Manager: The authorization manager detects authentication tokens from client requests to Feast servers and uses this information to enforce permission policies on the requested services.
The auth
section includes a type
field specifying the actual authorization protocol, and protocol-specific fields that are specified in .
Check out our to see how this concept can be applied in a real-world use case.
The file-based feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.
Features can also be written directly to the online store via .
Batch Materialization Engine: The component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through .
We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.
Don't see your offline store or online store of choice here? Check out our guides to make a custom one!
In order for a plugin integration to be highlighted, it must meet the following requirements:
The plugin must have tests. Ideally it would use the Feast universal tests (see this guide for an example), but custom tests are fine.
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).
In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:
The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.
If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.
Don't see your question?
We encourage you to ask questions on GitHub. Even better, once you get an answer, add the answer to this FAQ via a pull request!
The quickstart is the easiest way to learn about Feast. For more detailed tutorials, please check out the tutorials page.
No, there are feature views without entities.
Feast expects that each version of a model corresponds to a different feature service.
Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, feast plan
and feast apply
will throw errors if it sees this kind of behavior.
The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see data sources and offline store for more details.
Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.
get_historical_features
without providing an entity dataframe?Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See GitHub issue
Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).
It is a good idea though to lock down the registry file so only the CI/CD pipeline can modify it. That way data scientists and other users cannot accidentally modify the registry and lose other team's data.
Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support push based ingestion. Feast also defines a stream processor that allows a deeper integration with stream sources.
There are several kinds of transformations:
On demand transformations (See docs)
These transformations are Pandas transformations run on batch data when you call get_historical_features
and at online serving time when you call `get_online_features.
Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well
Batch transformations (WIP, see RFC)
These will include SQL + PySpark based transformations on batch data sources.
Streaming transformations (RFC in progress)
Yes. See documentation.
A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.
Feast is designed to work at scale and support low latency online serving. See our benchmark blog post for details.
Yes. Specifically:
Simple lists / dense embeddings:
BigQuery supports list types natively
Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)
Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see reference)
Sparse embeddings (e.g. one hot encodings)
One way to do this efficiently is to have a protobuf or string representation of https://www.tensorflow.org/guide/sparse_tensor
The list of supported offline and online stores can be found here and here, respectively. The roadmap indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about customizing Feast here.
Yes. Using a GCP or AWS provider in feature_store.yaml
primarily sets default offline / online stores and configures where the remote registry file can live. You can override the offline and online stores to be in different clouds if you wish.
The data source and the offline store are closely tied, but separate concepts. The offline store controls how feast talks to a data store for historical feature retrieval, and the data source points to specific table (or query) within a data store. Offline stores are infrastructure-level connectors to data stores like Snowflake.
Additional differences:
Data sources may be specific to a project (e.g. feed ranking), but offline stores are agnostic and used across projects.
A feast project may define several data sources that power different feature views, but a feast project has a single offline store.
Feast users typically need to define data sources when using feast, but only need to use/configure existing offline stores without creating new ones.
Please follow the instructions here.
Yes. For example, the Postgres connector can be used as both an offline and online store (as well as the registry).
Yes. There are two ways to use S3 in Feast:
Using Redshift as a data source via Spectrum (AWS tutorial), and then continuing with the Running Feast with Snowflake/GCP/AWS guide. See a presentation we did on this at our apply() meetup.
Using the s3_endpoint_override
in a FileSource
data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.
Please see the roadmap.
For more details on contributing to the Feast community, see here and this here.
Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this document for more details.
Please see this document. If you have any questions or suggestions, feel free to leave a comment on the document!
Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the roadmap.
Initial demonstration of Snowflake as an offline+online store with Feast, using the Snowflake demo template.
In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store + materialization engine + online store.
Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.
Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.
The following files will automatically be created in your project folder:
feature_store.yaml -- This is your main configuration file
driver_repo.py -- This is your main feature definition file
test.py -- This is a file to test your feature store configuration
feature_store.yaml
Here you will see the information that you entered. This template will use Snowflake as the offline store, materialization engine, and the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.
test.py
test.py
Install Feast using pip:
Install Feast with Snowflake dependencies (required when using Snowflake):
Install Feast with GCP dependencies (required when using BigQuery or Firestore):
Install Feast with AWS dependencies (required when using Redshift or DynamoDB):
Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):
The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider
that has been configured in your feature_store.yaml file, as well as the feature definitions found in your feature repository.
Here we'll be using the example repository we created in the previous guide, Create a feature store. You can re-create it by running feast init
in a new directory.
To have Feast deploy your infrastructure, run feast apply
from your command line while inside a feature repository:
Depending on whether the feature repository is configured to use a local
provider or one of the cloud providers like GCP
or AWS
, it may take from a couple of seconds to a minute to run to completion.
At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize
. See Load data into the online store for more details.
If you need to clean up the infrastructure created by feast apply
, use the teardown
command.
Warning: teardown
is an irreversible command and will remove all feature store infrastructure. Proceed with caution!
****
A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See Feature Repository for a detailed explanation of feature repositories.
The easiest way to create a new feature repository to use feast init
command:
The init
command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:
Enter the directory:
You can now use this feature repository for development. You can try the following:
Run feast apply
to apply these definitions to Feast.
Edit the example feature definitions in example.py
and run feast apply
again to change feature definitions.
Initialize a git repository in the same directory and checking the feature repository into version control.
Feast supports registering streaming feature views and Kafka and Kinesis streaming sources. It also provides an interface for stream processing called the Stream Processor
. An example Kafka/Spark StreamProcessor is implemented in the contrib folder. For more details, please see the RFC for more details.
Please see here for a tutorial on how to build a versioned streaming pipeline that registers your transformations, features, and data sources in Feast.
In this tutorial, we will use the public dataset of Chicago taxi trips to present data validation capabilities of Feast.
The original dataset is stored in BigQuery and consists of raw data for each taxi trip (one row per trip) since 2013.
We will generate several training datasets (aka historical features in Feast) for different periods and evaluate expectations made on one dataset against another.
Types of features we're ingesting and generating:
Features that aggregate raw data with daily intervals (eg, trips per day, average fare or speed for a specific day, etc.).
Features using SQL while pulling data from BigQuery (like total trips time or total miles travelled).
Features calculated on the fly when requested using Feast's on-demand transformations
Our plan:
Prepare environment
Pull data from BigQuery (optional)
Declare & apply features and feature views in Feast
Generate reference dataset
Develop & test profiler function
Run validation on different dataset using reference dataset & profiler
The original notebook and datasets for this tutorial can be found on GitHub.
Install Feast Python SDK and great expectations:
You can skip this step if you don't have GCP account. Please use parquet files that are coming with this tutorial instead
Running some basic aggregations while pulling data from BigQuery. Grouping by taxi_id and day:
Read more about feature views in Feast docs
Read more about on demand feature views here
Generating range of timestamps with daily frequency:
Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
156984 rows × 2 columns
Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:
Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.
Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.
Loading saved dataset first and exploring the data:
156984 rows × 10 columns
Feast uses Great Expectations as a validation engine and ExpectationSuite as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of PandasDataset (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.
Testing our profiler function:
Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).
Now we can create validation reference from dataset and profiler function:
and test it against our existing retrieval job
Validation successfully passed as no exception were raised.
Creating new timestamps for Dec 2020:
35448 rows × 2 columns
Execute retrieval job with validation reference:
Validation failed since several expectations didn't pass:
Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)
Average Fare increased - all quantiles are higher than expected
Earn per hour (mean) increased more than 10% (most probably due to increased fare)
These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.
A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.
Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.
Our end-to-end example will perform the following workflows:
Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data
Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):
Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.
When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.
In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.
This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.
This end-to-end tutorial will take you through the following steps:
Deploying S3 with Parquet as your primary data source, containing both loan features and zip code features
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving
Building a training dataset with Feast to train your credit scoring model
Loading feature values from S3 into DynamoDB
Making online predictions with your credit scoring model using features from DynamoDB
The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.
Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.
Please ensure that you have materialized (loaded) your feature values into the online store before starting
Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.
Next, we will create a feature store object and call get_online_features()
which reads the relevant feature values directly from the online store.
After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise (e.g. on Kubernetes).
A typical production architecture looks like:
Important note: Feast is highly customizable and modular.
Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.
In this guide we will show you how to:
Deploy your feature store and keep your infrastructure in sync with your feature repository
Keep the data in your online store up to date (from batch and stream sources)
Use Feast for model training and serving
The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running feast apply
commits feature definitions to a registry, which users can then read elsewhere.
Note: A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.
We recommend typically setting up CI/CD to automatically run feast plan
and feast apply
when pull requests are opened / merged.
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set).
Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.
Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.
It is up to you to orchestrate and schedule runs of materialization.
However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.
This supports pushing feature values into Feast to both online or offline stores.
After we've defined our features and data sources in the repository, we can generate training datasets. We highly recommend you use a FeatureService
to version the features that go into a specific model version.
The first thing we need to do in our training code is to create a FeatureStore
object with a path to the registry.
One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml
to those pipelines. This feature_store.yaml
file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
Then, you need to generate an entity dataframe. You have two options
Create an entity dataframe manually and pass it in
Use a SQL query to dynamically generate lists of entities (e.g. all entities within a time range) and timestamps to pass into Feast
Then, training data can be retrieved as follows:
The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the feature service name and the model versions have some established convention.
For example, in MLflow:
It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.
Once you have successfully loaded data from batch / streaming sources into the online store, you can start consuming features for model inference.
This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.
To integrate online retrieval into your service use the following code:
Basic steps
Add the Feast Helm repository and download the latest charts:
Run Helm Install
You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the ${ENV_VAR}
syntax in your feature_store.yaml
file. For instance:
In summary, the overall architecture in production may look like:
Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
Data ingestion
Batch data: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.
If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.
If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).
Stream data: The Feast Push API is used within existing Spark / Beam pipelines to push feature values to offline / online stores
Online features are served via the Python feature server over HTTP, or consumed using the Feast Python SDK.
Feast Python SDK is called locally to generate a training dataset
Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.
Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.
The materialize command allows users to materialize features over a specific historical time range into the online store.
The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.
It is also possible to materialize for specific feature views by using the -v / --views
argument.
The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.
For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize
, materialize-incremental
will track the state of previous ingestion runs inside of the feature registry.
The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00
).
The materialize-incremental
command functions similarly to materialize
in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.
Unlike materialize
, materialize-incremental
automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental
is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental
, the end timestamp will be tracked.
Subsequent runs of materialize-incremental
will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.
Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.
Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.
Start by defining the feature references (e.g., driver_trips:average_daily_rides
) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.
3. Create an entity dataframe
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp
and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.
It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.
Pandas:
In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp
column and a driver_id
entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.
SQL (Alternative):
Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).
4. Launch historical retrieval
Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features()
. This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df()
.
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set). Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
There are three common ways teams approach having separate environments
Have separate git branches for each environment
Have separate feature_store.yaml
files and separate Feast object definitions that correspond to each environment
Have separate feature_store.yaml
files per environment, but share the Feast object definitions
To keep a clear separation of the feature repos, teams may choose to have multiple long-lived branches in their version control system, one for each environment. In this approach, with CI/CD setup, changes would first be made to the staging branch, and then copied over manually to the production branch once verified in the staging environment.
feature_store.yaml
files and separate Feast object definitionsFor this approach, we have created an example repository () which contains two Feast projects, one per environment.
The contents of this repository are shown below:
The repository contains three sub-folders:
staging/
: This folder contains the staging feature_store.yaml
and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.
production/
: This folder contains the production feature_store.yaml
and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.
.github
: This folder is an example of a CI system that applies the changes in either the staging
or production
repositories using feast apply
. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.
The feature_store.yaml
contains the following:
Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply
are tracked in the registry.db
. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.
It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.
If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:
feature_store.yaml
filesThis approach is very similar to the previous approach, but instead of having feast objects duplicated and having to copy over changes, it may be possible to share the same Feast object definitions and have different feature_store.yaml
configuration.
An example of how such a repository would be structured is as follows:
Users can then apply the applying them to each environment in this way:
This setup has the advantage that you can share the feature definitions entirely, which may prevent issues with copy-pasting code.
In summary, once you have set up a Git based repository with CI that runs feast apply
on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.
Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.
The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.
However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).
The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).
The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.
Feast supports pluggable , that allow the materialization process to be scaled up. Aside from the local process, Feast supports a , and a .
Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.
taxi_id | event_timestamp | |
---|---|---|
total_earned | avg_trip_seconds | taxi_id | total_miles_travelled | trip_count | earned_per_hour | event_timestamp | total_trip_seconds | avg_fare | avg_speed | |
---|---|---|---|---|---|---|---|---|---|---|
taxi_id | event_timestamp | |
---|---|---|
Additionally, please check the how-to guide for some specific recommendations on .
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available .
Different options are presented in the .
This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. In this case, we recommend using one of the more , such as . Users may also need to to work on their existing infrastructure.
See also for code snippets
Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently.
If you are using Airflow as a scheduler, Feast can be invoked through a after the has been installed into a virtual environment and your feature repo has been synced:
You can see more in an example at .
See more details at , which shows how to ingest streaming features or 3rd party feature data via a push API.
Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See for an example and some tips.
For more details, see
To deploy a Feast feature server on Kubernetes, you can use the included (which also has detailed instructions and an example tutorial).
Install and
This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via ). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
Alternatively, deploy the same helm chart with a .
0
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-01
1
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-02
2
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-03
3
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-04
4
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-05
...
...
...
156979
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-27
156980
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-28
156981
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-29
156982
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-30
156983
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-07-01
0
68.25
2270.000000
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
24.70
2.0
54.118943
2019-06-01 00:00:00+00:00
4540.0
34.125000
19.585903
1
221.00
560.500000
7a4a6162eaf27805aef407d25d5cb21fe779cd962922cb...
54.18
24.0
59.143622
2019-06-01 00:00:00+00:00
13452.0
9.208333
14.499554
2
160.50
1010.769231
f4c9d05b215d7cbd08eca76252dae51cdb7aca9651d4ef...
41.30
13.0
43.972603
2019-06-01 00:00:00+00:00
13140.0
12.346154
11.315068
3
183.75
697.550000
c1f533318f8480a59173a9728ea0248c0d3eb187f4b897...
37.30
20.0
47.415956
2019-06-01 00:00:00+00:00
13951.0
9.187500
9.625116
4
217.75
1054.076923
455b6b5cae6ca5a17cddd251485f2266d13d6a2c92f07c...
69.69
13.0
57.206451
2019-06-01 00:00:00+00:00
13703.0
16.750000
18.308692
...
...
...
...
...
...
...
...
...
...
...
156979
38.00
1980.000000
0cccf0ec1f46d1e0beefcfdeaf5188d67e170cdff92618...
14.90
1.0
69.090909
2019-07-01 00:00:00+00:00
1980.0
38.000000
27.090909
156980
135.00
551.250000
beefd3462e3f5a8e854942a2796876f6db73ebbd25b435...
28.40
16.0
55.102041
2019-07-01 00:00:00+00:00
8820.0
8.437500
11.591837
156981
NaN
NaN
9a3c52aa112f46cf0d129fafbd42051b0fb9b0ff8dcb0e...
NaN
NaN
NaN
2019-07-01 00:00:00+00:00
NaN
NaN
NaN
156982
63.00
815.000000
08308c31cd99f495dea73ca276d19a6258d7b4c9c88e43...
19.96
4.0
69.570552
2019-07-01 00:00:00+00:00
3260.0
15.750000
22.041718
156983
NaN
NaN
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
NaN
NaN
NaN
2019-07-01 00:00:00+00:00
NaN
NaN
NaN
0
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-01
1
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-02
2
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-03
3
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-04
4
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-05
...
...
...
35443
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-03
35444
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-04
35445
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-05
35446
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-06
35447
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-07
Feast batch materialization operations (materialize
and materialize-incremental
) execute through a BatchMaterializationEngine
.
Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include:
Setting up custom materialization-specific infrastructure during feast apply
(e.g. setting up Spark clusters or Lambda Functions)
Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda)
Tearing down custom materialization-specific infrastructure during feast teardown
(e.g. tearing down Spark clusters, or deleting Lambda Functions)
Feast comes with built-in materialization engines, e.g, LocalMaterializationEngine
, and an experimental LambdaMaterializationEngine
. However, users can develop their own materialization engines by creating a class that implements the contract in the BatchMaterializationEngine class.
The fastest way to add custom logic to Feast is to extend an existing materialization engine. The most generic engine is the LocalMaterializationEngine
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom materialization engine class. We've created the MyCustomEngine
below.
Notice how in the above engine we have only overwritten two of the methods on the LocalMaterializatinEngine
, namely update
and materialize
. These two methods are convenient to replace if you are planning to launch custom batch jobs.
Configure your feature_store.yaml file to point to your new engine class:
Notice how the batch_engine
field above points to the module and class where your engine can be found.
Now you should be able to use your engine by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom engine!
This guide will go over:
how Feast tests are setup
how to extend the test suite to test new functionality
how to use the existing test suite to test a new custom offline / online store
Unit tests are contained in sdk/python/tests/unit
. Integration tests are contained in sdk/python/tests/integration
. Let's inspect the structure of sdk/python/tests/integration
:
feature_repos
has setup files for most tests in the test suite.
conftest.py
(in the parent directory) contains the most common fixtures, which are designed as an abstraction on top of specific offline/online stores, so tests do not need to be rewritten for different stores. Individual test files also contain more specific fixtures.
The tests are organized by which Feast component(s) they test.
The universal feature repo refers to a set of fixtures (e.g. environment
and universal_data_sources
) that can be parametrized to cover various combinations of offline stores, online stores, and providers. This allows tests to run against all these various combinations without requiring excess code. The universal feature repo is constructed by fixtures in conftest.py
with help from the various files in feature_repos
.
Tests in Feast are split into integration and unit tests. If a test requires external resources (e.g. cloud resources on GCP or AWS), it is an integration test. If a test can be run purely locally (where locally includes Docker resources), it is a unit test.
Integration tests test non-local Feast behavior. For example, tests that require reading data from BigQuery or materializing data to DynamoDB are integration tests. Integration tests also tend to involve more complex Feast functionality.
Unit tests test local Feast behavior. For example, tests that only require registering feature views are unit tests. Unit tests tend to only involve simple Feast functionality.
E2E tests
E2E tests test end-to-end functionality of Feast over the various codepaths (initialize a feature store, apply, and materialize).
The main codepaths include:
basic e2e tests for offline stores
test_universal_e2e.py
go feature server
test_go_feature_server.py
python http server
test_python_feature_server.py
data quality monitoring feature validation
test_validation.py
Offline and Online Store Tests
Offline and online store tests mainly test for the offline and online retrieval functionality.
The various specific functionalities that are tested include:
push API tests
test_push_features_to_offline_store.py
test_push_features_to_online_store.py
test_offline_write.py
historical retrieval tests
test_universal_historical_retrieval.py
online retrieval tests
test_universal_online.py
data quality monitoring feature logging tests
test_feature_logging.py
online store tests
test_universal_online.py
Registration Tests
The registration folder contains all of the registry tests and some universal cli tests. This includes:
CLI Apply and Materialize tests tested against on the universal test suite
Data type inference tests
Registry tests
Miscellaneous Tests
AWS Lambda Materialization Tests (Currently do not work)
test_lambda.py
Registry Diff Tests
These are tests for the infrastructure and registry diff functionality that Feast uses to determine if changes to the registry or infrastructure is needed.
Local CLI Tests and Local Feast Tests
These tests test all of the cli commands against the local file offline store.
Infrastructure Unit Tests
DynamoDB tests with dynamo mocked out
Repository configuration tests
Schema inference unit tests
Key serialization tests
Basic provider unit tests
Feature Store Validation Tests
These test mainly contain class level validation like hashing tests, protobuf and class serialization, and error and warning handling.
Data source unit tests
Feature service unit tests
Feature service, feature view, and feature validation tests
Protobuf/json tests for Feast ValueTypes
Serialization tests
Type mapping
Feast types
Serialization tests due to this issue
Docstring tests are primarily smoke tests to make sure imports and setup functions can be executed without errors.
Let's look at a sample test using the universal repo:
The key fixtures are the environment
and universal_data_sources
fixtures, which are defined in the feature_repos
directories and the conftest.py
file. This by default pulls in a standard dataset with driver and customer entities (that we have pre-defined), certain feature views, and feature values.
The environment
fixture sets up a feature store, parametrized by the provider and the online/offline store. It allows the test to query against that feature store without needing to worry about the underlying implementation or any setup that may be involved in creating instances of these datastores.
Each fixture creates a different integration test with its own IntegrationTestRepoConfig
which is used by pytest to generate a unique test testing one of the different environments that require testing.
Feast tests also use a variety of markers:
The @pytest.mark.integration
marker is used to designate integration tests which will cause the test to be run when you call make test-python-integration
.
The @pytest.mark.universal_offline_stores
marker will parametrize the test on all of the universal offline stores including file, redshift, bigquery and snowflake.
The full_feature_names
parametrization defines whether or not the test should reference features as their full feature name (fully qualified path) or just the feature name itself.
Use the same function signatures as an existing test (e.g. use environment
and universal_data_sources
as an argument) to include the relevant test fixtures.
If possible, expand an individual test instead of writing a new test, due to the cost of starting up offline / online stores.
Use the universal_offline_stores
and universal_online_store
markers to parametrize the test against different offline store and online store combinations. You can also designate specific online and offline stores to test by using the only
parameter on the marker.
Install Feast in editable mode with pip install -e
.
The core tests for offline / online store behavior are parametrized by the FULL_REPO_CONFIGS
variable defined in feature_repos/repo_configuration.py
. To overwrite this variable without modifying the Feast repo, create your own file that contains a FULL_REPO_CONFIGS
(which will require adding a new IntegrationTestRepoConfig
or two) and set the environment variable FULL_REPO_CONFIGS_MODULE
to point to that file. Then the core offline / online store tests can be run with make test-python-universal
.
See the custom offline store demo and the custom online store demo for examples.
Many problems arise when implementing your data store's type conversion to interface with Feast datatypes.
You will need to correctly update inference.py
so that Feast can infer your datasource schemas
You also need to update type_map.py
so that Feast knows how to convert your datastores types to Feast-recognized types in feast/types.py
.
The most important functionality in Feast is historical and online retrieval. Most of the e2e and universal integration test test this functionality in some way. Making sure this functionality works also indirectly asserts that reading and writing from your datastore works as intended.
Extend data_source_creator.py
for your offline store.
In repo_configuration.py
add a new IntegrationTestRepoConfig
or two (depending on how many online stores you want to test).
Generally, you should only need to test against sqlite. However, if you need to test against a production online store, then you can also test against Redis or dynamodb.
Run the full test suite with make test-python-integration.
This folder is for plugins that are officially maintained with community owners. Place the APIs in feast/infra/offline_stores/contrib/
.
Extend data_source_creator.py
for your offline store and implement the required APIs.
In contrib_repo_configuration.py
add a new IntegrationTestRepoConfig
(depending on how many online stores you want to test).
Run the test suite on the contrib test suite with make test-python-contrib-universal
.
In repo_configuration.py
add a new config that maps to a serialized version of configuration you need in feature_store.yaml
to setup the online store.
In repo_configuration.py
, add new IntegrationTestRepoConfig
for online stores you want to test.
Run the full test suite with make test-python-integration
Check test_universal_types.py
for an example of how to do this.
Install Redis on your computer. If you are a mac user, you should be able to brew install redis
.
Running redis-server --help
and redis-cli --help
should show corresponding help menus.
Run ./infra/scripts/redis-cluster.sh start
then ./infra/scripts/redis-cluster.sh create
to start the Redis cluster locally. You should see output that looks like this:
You should be able to run the integration tests and have the Redis cluster tests pass.
If you would like to run your own Redis cluster, you can run the above commands with your own specified ports and connect to the newly configured cluster.
To stop the cluster, run ./infra/scripts/redis-cluster.sh stop
and then ./infra/scripts/redis-cluster.sh clean
.
Feast makes adding support for a new online store (database) easy. Developers can simply implement the OnlineStore interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore).
In this guide, we will show you how to integrate with MySQL as an online store. While we will be implementing a specific store, this guide should be representative for adding support for any new online store.
The full working code for this guide can be found at feast-dev/feast-custom-online-store-demo.
The process of using a custom online store consists of 6 steps:
Defining the OnlineStore
class.
Defining the OnlineStoreConfig
class.
Referencing the OnlineStore
in a feature repo's feature_store.yaml
file.
Testing the OnlineStore
class.
Update dependencies.
Add documentation.
OnlineStore class names must end with the OnlineStore suffix!
New online stores go in sdk/python/feast/infra/online_stores/contrib/
.
Not guaranteed to implement all interface methods
Not guaranteed to be stable.
Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.
To move an online store plugin out of contrib, you need:
GitHub actions (i.e make test-python-integration
) is setup to run all tests against the online store and pass.
At least two contributors own the plugin (ideally tracked in our OWNERS
/ CODEOWNERS
file).
The OnlineStore class broadly contains two sets of methods
One set deals with managing infrastructure that the online store needed for operations
One set deals with writing data into the store, and reading data from the store.
There are two methods that deal with managing infrastructure for online stores, update
and teardown
update
is invoked when users run feast apply
as a CLI command, or the FeatureStore.apply()
sdk method.
The update
method should be used to perform any operations necessary before data can be written to or read from the store. The update
method can be used to create MySQL tables in preparation for reads and writes to new feature views.
teardown
is invoked when users run feast teardown
or FeatureStore.teardown()
.
The teardown
method should be used to perform any clean-up operations. teardown
can be used to drop MySQL indices and tables corresponding to the feature views being deleted.
There are two methods that deal with writing data to and from the online stores.online_write_batch
and online_read
.
online_write_batch
is invoked when running materialization (using the feast materialize
or feast materialize-incremental
commands, or the corresponding FeatureStore.materialize()
method.
online_read
is invoked when reading values from the online store using the FeatureStore.get_online_features()
method.
Additional configuration may be needed to allow the OnlineStore to talk to the backing store. For example, MySQL may need configuration information like the host at which the MySQL instance is running, credentials for connecting to the database, etc.
To facilitate configuration, all OnlineStore implementations are required to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the FeastConfigBaseModel
class, which is defined here.
The FeastConfigBaseModel
is a pydantic class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.
This config class must container a type
field, which contains the fully qualified class name of its corresponding OnlineStore class.
Additionally, the name of the config class must be the same as the OnlineStore class, with the Config
suffix.
An example of the config class for MySQL :
This configuration can be specified in the feature_store.yaml
as follows:
This configuration information is available to the methods of the OnlineStore, via theconfig: RepoConfig
parameter which is passed into all the methods of the OnlineStore interface, specifically at the config.online_store
field of the config
parameter.
After implementing both these classes, the custom online store can be used by referencing it in a feature repo's feature_store.yaml
file, specifically in the online_store
field. The value specified should be the fully qualified class name of the OnlineStore.
As long as your OnlineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.
To use our MySQL online store, we can use the following feature_store.yaml
:
If additional configuration for the online store is **not **required, then we can omit the other fields and only specify the type
of the online store class as the value for the online_store
.
Even if you have created the OnlineStore
class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.
In the Feast submodule, we can run all the unit tests and make sure they pass:
The universal tests, which are integration tests specifically intended to test offline and online stores, should be run against Feast to ensure that the Feast APIs works with your online store.
Feast parametrizes integration tests using the FULL_REPO_CONFIGS
variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py
which stores different online store classes for testing.
To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS
variable, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE
to point to that file.
A sample FULL_REPO_CONFIGS_MODULE
looks something like this:
If you are planning to start the online store up locally(e.g spin up a local Redis Instance) for testing, then the dictionary entry should be something like:
If you are planning instead to use a Dockerized container to run your tests against your online store, you can define a OnlineStoreCreator
and replace the None
object above with your OnlineStoreCreator
class. You should make this class available to pytest through the PYTEST_PLUGINS
environment variable.
If you create a containerized docker image for testing, developers who are trying to test with your online store will not have to spin up their own instance of the online store for testing. An example of an OnlineStoreCreator
is shown below:
3. Add a Makefile target to the Makefile to run your datastore specific tests by setting the FULL_REPO_CONFIGS_MODULE
environment variable. Add PYTEST_PLUGINS
if pytest is having trouble loading your DataSourceCreator
. You can remove certain tests that are not relevant or still do not work for your datastore using the -k
option.
If there are some tests that fail, this indicates that there is a mistake in the implementation of this online store!
Add any dependencies for your online store to our sdk/python/setup.py
under a new <ONLINE_STORE>_REQUIRED
list with the packages and add it to the setup script so that if your online store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default.
You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands:
Remember to add the documentation for your online store.
Add a new markdown file to docs/reference/online-stores/
.
You should also add a reference in docs/reference/online-stores/README.md
and docs/SUMMARY.md
. Add a new markdown document to document your online store functionality similar to how the other online stores are documented.
NOTE:Be sure to document the following things about your online store:
Be sure to cover how to create the datasource and what configuration is needed in the feature_store.yaml
file in order to create the datasource.
Make sure to flag that the online store is in alpha development.
Add some documentation on what the data model is for the specific online store for more clarity.
Finally, generate the python code docs by running:
All Feast operations execute through a provider
. Operations like materializing data from the offline to the online store, updating infrastructure like databases, launching streaming ingestion jobs, building training datasets, and reading features from the online store.
Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include:
Launching custom streaming ingestion jobs (Spark, Beam)
Launching custom batch ingestion (materialization) jobs (Spark, Beam)
Adding custom validation to feature repositories during feast apply
Adding custom infrastructure setup logic which runs during feast apply
Extending Feast commands with in-house metrics, logging, or tracing
Feast comes with built-in providers, e.g, LocalProvider
, GcpProvider
, and AwsProvider
. However, users can develop their own providers by creating a class that implements the contract in the Provider class.
This guide also comes with a fully functional custom provider demo repository. Please have a look at the repository for a representative example of what a custom provider looks like, or fork the repository when creating your own provider.
The fastest way to add custom logic to Feast is to extend an existing provider. The most generic provider is the LocalProvider
which contains no cloud-specific logic. The guide that follows will extend the LocalProvider
with operations that print text to the console. It is up to you as a developer to add your custom code to the provider methods, but the guide below will provide the necessary scaffolding to get you started.
The first step is to define a custom provider class. We've created the MyCustomProvider
below.
Notice how in the above provider we have only overwritten two of the methods on the LocalProvider
, namely update_infra
and materialize_single_feature_view
. These two methods are convenient to replace if you are planning to launch custom batch or streaming jobs. update_infra
can be used for launching idempotent streaming jobs, and materialize_single_feature_view
can be used for launching batch ingestion jobs.
It is possible to overwrite all the methods on the provider class. In fact, it isn't even necessary to subclass an existing provider like LocalProvider
. The only requirement for the provider class is that it follows the Provider contract.
Configure your feature_store.yaml file to point to your new provider class:
Notice how the provider
field above points to the module and class where your provider can be found.
Now you should be able to use your provider by running a Feast command:
It may also be necessary to add the module root path to your PYTHONPATH
as follows:
That's it. You should now have a fully functional custom provider!
Have a look at the custom provider demo repository for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider!
TLS (Transport Layer Security) and SSL (Secure Sockets Layer) are both protocols encrypts communications between a client and server to provide enhanced security.TLS or SSL words used interchangeably. This article is going to show the sample code to start all the feast servers such as online server, offline server, registry server and UI server in TLS mode. Also show examples related to feast clients to communicate with the feast servers started in TLS mode.
We assume you have basic understanding of feast terminology before going through this tutorial, if you are new to feast then we would recommend to go through existing starter tutorials of feast.
In development mode we can generate a self-signed certificate for testing. In an actual production environment it is always recommended to get it from a trusted TLS certificate provider.
The above command will generate two files
key.pem
: certificate private key
cert.pem
: certificate public key
You can use the public or private keys generated from above command in the rest of the sections in this tutorial.
Create a feast repo and initialize using feast init
and feast apply
command and use this repo as a demo for subsequent sections.
You need to execute the feast cli commands from feast_repo_ssl_demo/feature_repo
directory created from the above feast init
command.
To start the feature server in TLS mode, you need to provide the private and public keys using the --key
and --cert
arguments with the feast serve
command.
You will see the output something similar to as below. Note the server url starts in the https
mode.
Sometimes you may need to pass the self-signed public key to connect to the remote online server started in SSL mode if you have not added the public key to the certificate store.
feast client example: The registry is pointing to registry of remote feature store. If it is not accessible then should be configured to use remote registry.
cert
is an optional configuration to the public certificate path when the online server starts in TLS(SSL) mode. Typically, this file ends with *.crt
, *.cer
, or *.pem
.
To start the feature server in TLS mode, you need to provide the private and public keys using the --key
and --cert
arguments with the feast serve_registry
command.
You will see the output something similar to as below. Note the server url starts in the https
mode.
Sometimes you may need to pass the self-signed public key to connect to the remote registry server started in SSL mode if you have not added the public key to the certificate store.
feast client example:
cert
is an optional configuration to the public certificate path when the registry server starts in TLS(SSL) mode. Typically, this file ends with *.crt
, *.cer
, or *.pem
.
To start the offline server in TLS mode, you need to provide the private and public keys using the --key
and --cert
arguments with the feast serve_offline
command.
You will see the output something similar to as below. Note the server url starts in the https
mode.
Sometimes you may need to pass the self-signed public key to connect to the remote registry server started in SSL mode if you have not added the public key to the certificate store. You have to add scheme
to https
.
feast client example:
cert
is an optional configuration to the public certificate path when the registry server starts in TLS(SSL) mode. Typically, this file ends with *.crt
, *.cer
, or *.pem
. scheme
should be https
. By default, it will be http
so you have to explicitly configure to https
if you are planning to connect to remote offline server which is started in TLS mode.
To start the feast UI server in TLS mode, you need to provide the private and public keys using the --key
and --cert
arguments with the feast ui
command.
You will see the output something similar to as below. Note the server url starts in the https
mode.
Let's examine the Feast codebase. This analysis is accurate as of Feast 0.23.
The Python SDK lives in sdk/python/feast
. The majority of Feast logic lives in these Python files:
The core Feast objects (entities, feature views, data sources, etc.) are defined in their respective Python files, such as entity.py
, feature_view.py
, and data_source.py
.
The FeatureStore
class is defined in feature_store.py
and the associated configuration object (the Python representation of the feature_store.yaml
file) are defined in repo_config.py
.
The CLI and other core feature store logic are defined in cli.py
and repo_operations.py
.
The type system that is used to manage conversion between Feast types and external typing systems is managed in type_map.py
.
The Python feature server (the server that is started through the feast serve
command) is defined in feature_server.py
.
There are also several important submodules:
infra/
contains all the infrastructure components, such as the provider, offline store, online store, batch materialization engine, and registry.
dqm/
covers data quality monitoring, such as the dataset profiler.
diff/
covers the logic for determining how to apply infrastructure changes upon feature repo changes (e.g. the output of feast plan
and feast apply
).
embedded_go/
covers the Go feature server.
ui/
contains the embedded Web UI, to be launched on the feast ui
command.
Of these submodules, infra/
is the most important. It contains the interfaces for the provider, offline store, online store, batch materialization engine, and registry, as well as all of their individual implementations.
The tests for the Python SDK are contained in sdk/python/tests
. For more details, see this overview of the test suite.
feast apply
Let's walk through how feast apply
works by tracking its execution across the codebase.
All CLI commands are in cli.py
. Most of these commands are backed by methods in repo_operations.py
. The feast apply
command triggers apply_total_command
, which then calls apply_total
in repo_operations.py
.
With a FeatureStore
object (from feature_store.py
) that is initialized based on the feature_store.yaml
in the current working directory, apply_total
first parses the feature repo with parse_repo
and then calls either FeatureStore.apply
or FeatureStore._apply_diffs
to apply those changes to the feature store.
Let's examine FeatureStore.apply
. It splits the objects based on class (e.g. Entity
, FeatureView
, etc.) and then calls the appropriate registry method to apply or delete the object. For example, it might call self._registry.apply_entity
to apply an entity. If the default file-based registry is used, this logic can be found in infra/registry/registry.py
.
Then the feature store must update its cloud infrastructure (e.g. online store tables) to match the new feature repo, so it calls Provider.update_infra
, which can be found in infra/provider.py
.
Assuming the provider is a built-in provider (e.g. one of the local, GCP, or AWS providers), it will call PassthroughProvider.update_infra
in infra/passthrough_provider.py
.
This delegates to the online store and batch materialization engine. For example, if the feature store is configured to use the Redis online store then the update
method from infra/online_stores/redis.py
will be called. And if the local materialization engine is configured then the update
method from infra/materialization/local_engine.py
will be called.
At this point, the feast apply
command is complete.
feast materialize
Let's walk through how feast materialize
works by tracking its execution across the codebase.
The feast materialize
command triggers materialize_command
in cli.py
, which then calls FeatureStore.materialize
from feature_store.py
.
This then calls Provider.materialize_single_feature_view
, which can be found in infra/provider.py
.
As with feast apply
, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.materialize_single_feature_view
will be called.
This delegates to the underlying batch materialization engine. Assuming that the local engine has been configured, LocalMaterializationEngine.materialize
from infra/materialization/local_engine.py
will be called.
Since materialization involves reading features from the offline store and writing them to the online store, the local engine will delegate to both the offline store and online store. Specifically, it will call OfflineStore.pull_latest_from_table_or_query
and OnlineStore.online_write_batch
. These two calls will be routed to the offline store and online store that have been configured.
get_historical_features
Let's walk through how get_historical_features
works by tracking its execution across the codebase.
We start with FeatureStore.get_historical_features
in feature_store.py
. This method does some internal preparation, and then delegates the actual execution to the underlying provider by calling Provider.get_historical_features
, which can be found in infra/provider.py
.
As with feast apply
, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.get_historical_features
will be called.
That call simply delegates to OfflineStore.get_historical_features
. So if the feature store is configured to use Snowflake as the offline store, SnowflakeOfflineStore.get_historical_features
will be executed.
The java/
directory contains the Java serving component. See here for more details on how the repo is structured.
The go/
directory contains the Go feature server. Most of the files here have logic to help with reading features from the online store. Within go/
, the internal/feast/
directory contains most of the core logic:
onlineserving/
covers the core serving logic.
model/
contains the implementations of the Feast objects (entity, feature view, etc.).
For example, entity.go
is the Go equivalent of entity.py
. It contains a very simple Go implementation of the entity object.
registry/
covers the registry.
Currently only the file-based registry supported (the sql-based registry is unsupported). Additionally, the file-based registry only supports a file-based registry store, not the GCS or S3 registry stores.
onlinestore/
covers the online stores (currently only Redis and SQLite are supported).
Feast uses protobuf to store serialized versions of the core Feast objects. The protobuf definitions are stored in protos/feast
.
The registry consists of the serialized representations of the Feast objects.
Typically, changes being made to the Feast objects require changes to their corresponding protobuf representations. The usual best practices for making changes to protobufs should be followed ensure backwards and forwards compatibility.
The ui/
directory contains the Web UI. See here for more details on the structure of the Web UI.
Feast uses an internal type system to provide guarantees on training and serving data. Feast currently supports eight primitive types - INT32
, INT64
, FLOAT32
, FLOAT64
, STRING
, BYTES
, BOOL
, and UNIX_TIMESTAMP
- and the corresponding array types. Null types are not supported, although the UNIX_TIMESTAMP
type is nullable. The type system is controlled by Value.proto
in protobuf and by types.py
in Python. Type conversion logic can be found in type_map.py
.
During feast apply
, Feast runs schema inference on the data sources underlying feature views. For example, if the schema
parameter is not specified for a feature view, Feast will examine the schema of the underlying data source to determine the event timestamp column, feature columns, and entity columns. Each of these columns must be associated with a Feast type, which requires conversion from the data source type system to the Feast type system.
The feature inference logic calls _infer_features_and_entities
.
_infer_features_and_entities
calls source_datatype_to_feast_value_type
.
source_datatype_to_feast_value_type
cals the appropriate method in type_map.py
. For example, if a SnowflakeSource
is being examined, snowflake_python_type_to_feast_value_type
from type_map.py
will be called.
Feast serves feature values as Value
proto objects, which have a type corresponding to Feast types. Thus Feast must materialize feature values into the online store as Value
proto objects.
The local materialization engine first pulls the latest historical features and converts it to pyarrow.
Then it calls _convert_arrow_to_proto
to convert the pyarrow table to proto format.
This calls python_values_to_proto_values
in type_map.py
to perform the type conversion.
The Feast type system is typically not necessary when retrieving historical features. A call to get_historical_features
will return a RetrievalJob
object, which allows the user to export the results to one of several possible locations: a Pandas dataframe, a pyarrow table, a data lake (e.g. S3 or GCS), or the offline store (e.g. a Snowflake table). In all of these cases, the type conversion is handled natively by the offline store. For example, a BigQuery query exposes a to_dataframe
method that will automatically convert the result to a dataframe, without requiring any conversions within Feast.
As mentioned above in the section on materialization, Feast persists feature values into the online store as Value
proto objects. A call to get_online_features
will return an OnlineResponse
object, which essentially wraps a bunch of Value
protos with some metadata. The OnlineResponse
object can then be converted into a Python dictionary, which calls feast_value_type_to_python_type
from type_map.py
, a utility that converts the Feast internal types to Python native types.
Feast makes adding support for a new offline store easy. Developers can simply implement the OfflineStore interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).
In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.
The full working code for this guide can be found at feast-dev/feast-custom-offline-store-demo.
The process for using a custom offline store consists of 8 steps:
Defining an OfflineStore
class.
Defining an OfflineStoreConfig
class.
Defining a RetrievalJob
class for this offline store.
Defining a DataSource
class for the offline store
Referencing the OfflineStore
in a feature repo's feature_store.yaml
file.
Testing the OfflineStore
class.
Updating dependencies.
Adding documentation.
OfflineStore class names must end with the OfflineStore suffix!
New offline stores go in sdk/python/feast/infra/offline_stores/contrib/
.
Not guaranteed to implement all interface methods
Not guaranteed to be stable.
Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.
To move an offline store plugin out of contrib, you need:
GitHub actions (i.e make test-python-integration
) is setup to run all tests against the offline store and pass.
At least two contributors own the plugin (ideally tracked in our OWNERS
/ CODEOWNERS
file).
The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.
To fully implement the interface for the offline store, you will need to implement these methods:
pull_latest_from_table_or_query
is invoked when running materialization (using the feast materialize
or feast materialize-incremental
commands, or the corresponding FeatureStore.materialize()
method. This method pull data from the offline store, and the FeatureStore
class takes care of writing this data into the online store.
get_historical_features
is invoked when reading values from the offline store using the FeatureStore.get_historical_features()
method. Typically, this method is used to retrieve features when training ML models.
(optional) offline_write_batch
is a method that supports directly pushing a pyarrow table to a feature view. Given a feature view with a specific schema, this function should write the pyarrow table to the batch source defined. More details about the push api can be found here. This method only needs implementation if you want to support the push api in your offline store.
(optional) pull_all_from_table_or_query
is a method that pulls all the data from an offline store from a specified start date to a specified end date. This method is only used for SavedDatasets as part of data quality monitoring validation.
(optional) write_logged_features
is a method that takes a pyarrow table or a path that points to a parquet file and writes the data to a defined source defined by LoggingSource
and LoggingConfig
. This method is only used internally for SavedDatasets.
Most offline stores will have to perform some custom mapping of offline store datatypes to feast value types.
The function to implement here are source_datatype_to_feast_value_type
and get_column_names_and_types
in your DataSource
class.
source_datatype_to_feast_value_type
is used to convert your DataSource's datatypes to feast value types.
get_column_names_and_types
retrieves the column names and corresponding datasource types.
Add any helper functions for type conversion to sdk/python/feast/type_map.py
.
Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data.
Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.
To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel
class, which is defined here.
The FeastConfigBaseModel
is a pydantic class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.
This config class must container a type
field, which contains the fully qualified class name of its corresponding OfflineStore class.
Additionally, the name of the config class must be the same as the OfflineStore class, with the Config
suffix.
An example of the config class for the custom file offline store :
This configuration can be specified in the feature_store.yaml
as follows:
This configuration information is available to the methods of the OfflineStore, via the config: RepoConfig
parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store
field of the config
parameter. This fields in the feature_store.yaml
should map directly to your OfflineStoreConfig
class that is detailed above in Section 2.
The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob
instance, which represents the execution of the actual query against the underlying store.
Custom offline stores may need to implement their own instances of the RetrievalJob
interface.
The RetrievalJob
interface exposes two methods - to_df
and to_arrow
. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.
Users who want to have their offline store support scalable batch materialization for online use cases (detailed in this RFC) will also need to implement to_remote_storage
to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom Materialization Engine to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize).
Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource
base class needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.
The data source class should implement two methods - from_proto
, and to_proto
.
For custom offline stores that are not being implemented in the main feature repo, the custom_options
field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto
method and reading the value back from bytes in the from_proto
method.
After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml
file, specifically in the offline_store
field. The value specified should be the fully qualified class name of the OfflineStore.
As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.
To use our custom file offline store, we can use the following feature_store.yaml
:
If additional configuration for the offline store is not required, then we can omit the other fields and only specify the type
of the offline store class as the value for the offline_store
.
Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.
Even if you have created the OfflineStore
class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.
In order to test against the test suite, you need to create a custom DataSourceCreator
that implement our testing infrastructure methods, create_data_source
and optionally, created_saved_dataset_destination
.
create_data_source
should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See BigQueryDataSourceCreator
for an implementation of a data source creator.
created_saved_dataset_destination
is invoked when users need to save the dataset for use in data validation. This functionality is still in alpha and is optional.
Make sure that your offline store doesn't break any unit tests first by running:
Next, set up your offline store to run the universal integration tests. These are integration tests specifically intended to test offline and online stores against Feast API functionality, to ensure that the Feast APIs works with your offline store.
Feast parametrizes integration tests using the FULL_REPO_CONFIGS
variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py
which stores different offline store classes for testing.
To overwrite the default configurations to use your own offline store, you can simply create your own file that contains a FULL_REPO_CONFIGS
dictionary, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE
to point to that file. The module should add new IntegrationTestRepoConfig
classes to the AVAILABLE_OFFLINE_STORES
by defining an offline store that you would like Feast to test with.
A sample FULL_REPO_CONFIGS_MODULE
looks something like this:
You should swap out the FULL_REPO_CONFIGS
environment variable and run the integration tests against your offline store. In the example repo, the file that overwrites FULL_REPO_CONFIGS
is feast_custom_offline_store/feast_tests.py
, so you would run:
If the integration tests fail, this indicates that there is a mistake in the implementation of this offline store!
Remember to add your datasource to repo_config.py
similar to how we added spark
, trino
, etc, to the dictionary OFFLINE_STORE_CLASS_FOR_TYPE
. This will allow Feast to load your class from the feature_store.yaml
.
Finally, add a Makefile target to the Makefile to run your datastore specific tests by setting the FULL_REPO_CONFIGS_MODULE
and PYTEST_PLUGINS
environment variable. The PYTEST_PLUGINS
environment variable allows pytest to load in the DataSourceCreator
for your datasource. You can remove certain tests that are not relevant or still do not work for your datastore using the -k
option.
Add any dependencies for your offline store to our sdk/python/setup.py
under a new <OFFLINE_STORE>__REQUIRED
list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands:
Remember to add documentation for your offline store.
Add a new markdown file to docs/reference/offline-stores/
and docs/reference/data-sources/
. Use these files to document your offline store functionality similar to how the other offline stores are documented.
You should also add a reference in docs/reference/data-sources/README.md
and docs/SUMMARY.md
to these markdown files.
NOTE: Be sure to document the following things about your offline store:
How to create the datasource and most what configuration is needed in the feature_store.yaml
file in order to create the datasource.
Make sure to flag that the datasource is in alpha development.
Add some documentation on what the data model is for the specific offline store for more clarity.
Finally, generate the python code docs by running:
Feast is highly pluggable and configurable:
One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.
The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.
Below are some guides on how to add new custom components:
Please see Data Source for a conceptual explanation of data sources.
BigQuery data sources are BigQuery tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.
Using a table reference:
Using a query:
The full set of configuration options is available .
In Feast, each batch data source is associated with corresponding offline stores. For example, a SnowflakeSource
can only be processed by the Snowflake offline store, while a FileSource
can be processed by both File and DuckDB offline stores. Otherwise, the primary difference between batch data sources is the set of supported types. Feast has an internal type system, and aims to support eight primitive types (bytes
, string
, int32
, int64
, float32
, float64
, bool
, and timestamp
) along with the corresponding array types. However, not every batch data source supports all of these types.
For more details on the Feast type system, see .
There are currently four core batch data source implementations: FileSource
, BigQuerySource
, SnowflakeSource
, and RedshiftSource
. There are several additional implementations contributed by the Feast community (PostgreSQLSource
, SparkSource
, and TrinoSource
), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific data source can be found .
Below is a matrix indicating which data sources support which types.
File | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino |
---|
File data sources are files on disk or on S3. Currently only Parquet and Delta formats are supported.
The full set of configuration options is available .
File data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .
Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.
Using a table name:
Using a query:
The full set of configuration options is available .
Push sources allow feature values to be pushed to the online store and offline store in real time. This allows fresh feature values to be made available to applications. Push sources supercede the .
Push sources can be used by multiple feature views. When data is pushed to a push source, Feast propagates the feature values to all the consuming feature views.
Push sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for pushing data to a batch data source such as a data warehouse table. When using a push source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.
Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:
Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories
) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance
Feast allows users to push features previously registered in a feature view to the online store for fresher features. It also allows users to push batches of stream data to the offline store by specifying that the push be directed to the offline store. This will push the data to the offline store declared in the repository configuration used to initialize the feature store.
Note that the push schema needs to also include the entity.
Note that the to
parameter is optional and defaults to online but we can specify these options: PushMode.ONLINE
, PushMode.OFFLINE
, or PushMode.ONLINE_AND_OFFLINE
.
The default option to write features from a stream is to add the Python SDK into your existing PySpark pipeline.
BigQuery data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .
Be careful about how Snowflake handles table and column name conventions. In particular, you can read more about quote identifiers .
The full set of configuration options is available .
Snowflake data sources support all eight primitive types. Array types are also supported but not with type inference. For a comparison against other batch data sources, please see .
Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .
See also for instructions on how to push data to a deployed feature server.
This can also be used under the hood by a contrib stream processor (see )
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
| yes | yes | yes | yes | yes | yes | yes |
array types | yes | yes | yes | no | yes | yes | no |
PostgreSQL data sources are PostgreSQL tables or views. These can be specified either by a table reference or a SQL query.
The PostgreSQL data source does not achieve full test coverage. Please do not assume complete stability.
Defining a Postgres source:
The full set of configuration options is available here.
PostgreSQL data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.
Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.
The Spark data source does not achieve full test coverage. Please do not assume complete stability.
Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):
Using a query:
Using a file reference:
The full set of configuration options is available here.
Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.
Trino data sources are Trino tables or views. These can be specified either by a table reference or a SQL query.
The Trino data source does not achieve full test coverage. Please do not assume complete stability.
Defining a Trino source:
The full set of configuration options is available here.
Trino data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see here.
Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.
Kafka sources allow users to register Kafka streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kafka. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through FeatureStore.write_to_online_store. An example of how to launch such a job with Spark can be found here. Feast also provides functionality to write to the offline store using the write_to_offline_store
functionality.
Kafka sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kafka streams to a batch data source such as a data warehouse table. When using a Kafka source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.
Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:
Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories
) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance
Note that the Kafka source has a batch source.
The Kafka source can be used in a stream feature view.
See here for a example of how to ingest data from a Kafka source into Feast.
Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.
Kinesis sources allow users to register Kinesis streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kinesis. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through FeatureStore.write_to_online_store. An example of how to launch such a job with Spark to ingest from Kafka can be found here; by using a different plugin, the example can be adapted to Kinesis. Feast also provides functionality to write to the offline store using the write_to_offline_store
functionality.
Kinesis sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kinesis streams to a batch data source such as a data warehouse table. When using a Kinesis source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.
Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:
Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories
) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance
Note that the Kinesis source has a batch source.
The Kinesis source can be used in a stream feature view.
See here for a example of how to ingest data from a Kafka source into Feast. The approach used in the tutorial can be easily adapted to work for Kinesis as well.
Here are the methods exposed by the OfflineStore
interface, along with the core functionality supported by the method:
get_historical_features
: point-in-time correct join to retrieve historical features
pull_latest_from_table_or_query
: retrieve latest feature values for materialization into the online store
pull_all_from_table_or_query
: retrieve a saved dataset
offline_write_batch
: persist dataframes to the offline store, primarily for push sources
write_logged_features
: persist logged features to the offline store, for feature logging
The first three of these methods all return a RetrievalJob
specific to an offline store, such as a SnowflakeRetrievalJob
. Here is a list of functionality supported by RetrievalJob
s:
export to dataframe
export to arrow table
export to arrow batches (to handle large datasets in memory)
export to SQL
export to data lake (S3, GCS, etc.)
export to data warehouse
export as Spark dataframe
local execution of Python-based on-demand transforms
remote execution of Python-based on-demand transforms
persist results in the offline store
preview the query plan before execution (RetrievalJob
s are lazily executed)
read partitioned data
There are currently four core offline store implementations: DaskOfflineStore
, BigQueryOfflineStore
, SnowflakeOfflineStore
, and RedshiftOfflineStore
. There are several additional implementations contributed by the Feast community (PostgreSQLOfflineStore
, SparkOfflineStore
, and TrinoOfflineStore
), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific offline store, such as how to configure it in a feature_store.yaml
, can be found here.
Below is a matrix indicating which offline stores support which methods.
Below is a matrix indicating which RetrievalJob
s support what functionality.
Please see Offline Store for a conceptual explanation of offline stores.
The Dask offline store provides support for reading FileSources.
All data is downloaded and joined using Python and therefore may not scale to production workloads.
The full set of configuration options is available in DaskOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the dask offline store.
Dask | |
---|---|
Below is a matrix indicating which functionality is supported by DaskRetrievalJob
.
To compare this set of functionality against other offline stores, please see the full functionality matrix.
The BigQuery offline store provides support for reading .
All joins happen within BigQuery.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to BigQuery as a table (marked for expiration) in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[gcp]'
. You can get started by then running feast init -t gcp
.
The full set of configuration options is available in .
Below is a matrix indicating which functionality is supported by BigQueryRetrievalJob
.
The PostgreSQL offline store provides support for reading .
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Postgres as a table in order to complete join operations.
The PostgreSQL offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[postgres]'
. You can get started by then running feast init -t postgres
.
Below is a matrix indicating which functionality is supported by PostgreSQLRetrievalJob
.
The duckdb offline store provides support for reading . It can read both Parquet and Delta formats. DuckDB offline store uses under the hood to convert offline store operations to DuckDB queries.
Entity dataframes can be provided as a Pandas dataframe.
In order to use this offline store, you'll need to run pip install 'feast[duckdb]'
.
Below is a matrix indicating which functionality is supported by IbisRetrievalJob
.
The offline store provides support for reading .
All joins happen within Snowflake.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Snowflake as a temporary table in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[snowflake]'
.
If you're using a file based registry, then you'll also need to install the relevant cloud extra (pip install 'feast[snowflake, CLOUD]'
where CLOUD
is one of aws
, gcp
, azure
)
You can get started by then running feast init -t snowflake
.
Please be aware that here is a restriction/limitation for using SQL query string in Feast with Snowflake. Try to avoid the usage of single quote in SQL query string. For example, the following query string will fail:
Below is a matrix indicating which functionality is supported by SnowflakeRetrievalJob
.
The Redshift offline store provides support for reading .
All joins happen within Redshift.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Redshift temporarily in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[aws]'
. You can get started by then running feast init -t aws
.
Below is a matrix indicating which functionality is supported by RedshiftRetrievalJob
.
Feast requires the following permissions in order to execute commands for Redshift offline store:
The following inline policy can be used to grant Feast the necessary permissions:
The following inline policy can be used to grant Redshift necessary permissions to access S3:
While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:
The Spark offline store provides support for reading .
Entity dataframes can be provided as a SQL query, Pandas dataframe or can be provided as a Pyspark dataframe. A Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
The Spark offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[spark]'
. You can get started by then running feast init -t spark
.
Below is a matrix indicating which functionality is supported by SparkRetrievalJob
.
Dask | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino | |
---|---|---|---|---|---|---|---|
Dask | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino | DuckDB | |
---|---|---|---|---|---|---|---|---|
Dask | |
---|---|
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the BigQuery offline store.
BigQuery |
---|
BigQuery |
---|
*See for details on proposed solutions for enabling the BigQuery offline store to understand tables that use _PARTITIONTIME
as the partition column.
To compare this set of functionality against other offline stores, please see the full .
Note that sslmode
, sslkey_path
, sslcert_path
, and sslrootcert_path
are optional parameters. The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the PostgreSQL offline store.
Postgres |
---|
Postgres |
---|
To compare this set of functionality against other offline stores, please see the full .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the DuckDB offline store.
DuckdDB |
---|
DuckDB |
---|
To compare this set of functionality against other offline stores, please see the full .
The full set of configuration options is available in .
That 'value' will fail in Snowflake. Instead, please use pairs of dollar signs like $$value$$
as .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Snowflake offline store.
Snowflake |
---|
Snowflake |
---|
To compare this set of functionality against other offline stores, please see the full .
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Redshift offline store.
Redshift |
---|
Redshift |
---|
To compare this set of functionality against other offline stores, please see the full .
In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run and commands. Once created, this IAM role needs to be configured in feature_store.yaml
file as offline_store: iam_role
.
In order to use , specify a workgroup instead of a cluster_id and user.
Please note that the IAM policies above will need the version, rather than the standard .
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.
Spark |
---|
Spark |
---|
To compare this set of functionality against other offline stores, please see the full .
get_historical_features
yes
yes
yes
yes
yes
yes
yes
pull_latest_from_table_or_query
yes
yes
yes
yes
yes
yes
yes
pull_all_from_table_or_query
yes
yes
yes
yes
yes
yes
yes
offline_write_batch
yes
yes
yes
yes
no
no
no
write_logged_features
yes
yes
yes
yes
no
no
no
export to dataframe
yes
yes
yes
yes
yes
yes
yes
yes
export to arrow table
yes
yes
yes
yes
yes
yes
yes
yes
export to arrow batches
no
no
no
yes
no
no
no
no
export to SQL
no
yes
yes
yes
yes
no
yes
no
export to data lake (S3, GCS, etc.)
no
no
yes
no
yes
no
no
no
export to data warehouse
no
yes
yes
yes
yes
no
no
no
export as Spark dataframe
no
no
yes
no
no
yes
no
no
local execution of Python-based on-demand transforms
yes
yes
yes
yes
yes
no
yes
yes
remote execution of Python-based on-demand transforms
no
no
no
no
no
no
no
no
persist results in the offline store
yes
yes
yes
yes
yes
yes
no
yes
preview the query plan before execution
yes
yes
yes
yes
yes
yes
yes
no
read partitioned data
yes
yes
yes
yes
yes
yes
yes
yes
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
yes
write_logged_features
(persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
| yes |
| yes |
| yes |
| yes |
| yes |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | no |
export to SQL | yes |
export to data lake (S3, GCS, etc.) | no |
export to data warehouse | yes |
export as Spark dataframe | no |
local execution of Python-based on-demand transforms | yes |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | yes |
read partitioned data* | partial |
| yes |
| yes |
| yes |
| no |
| no |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | no |
export to SQL | yes |
export to data lake (S3, GCS, etc.) | yes |
export to data warehouse | yes |
export as Spark dataframe | no |
local execution of Python-based on-demand transforms | yes |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | yes |
read partitioned data | yes |
| yes |
| yes |
| yes |
| yes |
| yes |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | no |
export to SQL | no |
export to data lake (S3, GCS, etc.) | no |
export to data warehouse | no |
export as Spark dataframe | no |
local execution of Python-based on-demand transforms | yes |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | no |
read partitioned data | yes |
| yes |
| yes |
| yes |
| yes |
| yes |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | yes |
export to SQL | yes |
export to data lake (S3, GCS, etc.) | yes |
export to data warehouse | yes |
export as Spark dataframe | yes |
local execution of Python-based on-demand transforms | yes |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | yes |
read partitioned data | yes |
| yes |
| yes |
| yes |
| yes |
| yes |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | yes |
export to SQL | yes |
export to data lake (S3, GCS, etc.) | no |
export to data warehouse | yes |
export as Spark dataframe | no |
local execution of Python-based on-demand transforms | yes |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | yes |
read partitioned data | yes |
Command | Permissions | Resources |
Apply | redshift-data:DescribeTable redshift:GetClusterCredentials | arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username> arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name> arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id> |
Materialize | redshift-data:ExecuteStatement | arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id> |
Materialize | redshift-data:DescribeStatement | * |
Materialize | s3:ListBucket s3:GetObject s3:DeleteObject | arn:aws:s3:::<bucket_name> arn:aws:s3:::<bucket_name>/* |
Get Historical Features | redshift-data:ExecuteStatement redshift:GetClusterCredentials | arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username> arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name> arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id> |
Get Historical Features | redshift-data:DescribeStatement | * |
Get Historical Features | s3:ListBucket s3:GetObject s3:PutObject s3:DeleteObject | arn:aws:s3:::<bucket_name> arn:aws:s3:::<bucket_name>/* |
| yes |
| yes |
| yes |
| no |
| no |
export to dataframe | yes |
export to arrow table | yes |
export to arrow batches | no |
export to SQL | no |
export to data lake (S3, GCS, etc.) | no |
export to data warehouse | no |
export as Spark dataframe | yes |
local execution of Python-based on-demand transforms | no |
remote execution of Python-based on-demand transforms | no |
persist results in the offline store | yes |
preview the query plan before execution | yes |
read partitioned data | yes |
The Remote Offline Store is an Arrow Flight client for the offline store that implements the RemoteOfflineStore
class using the existing OfflineStore
interface. The client implements various methods, including get_historical_features
, pull_latest_from_table_or_query
, write_logged_features
, and offline_write_batch
.
User needs to create client side feature_store.yaml
file and set the offline_store
type remote
and provide the server connection configuration including adding the host and specifying the port (default is 8815) required by the Arrow Flight client to connect with the Arrow Flight server.
The complete example can be find under remote-offline-store-example
Please see the detail how to configure offline feature server offline-feature-server.md
Please refer the page for more details on how to configure authentication and authorization.
The list below contains the functionality that contributors are planning to develop for Feast.
We welcome contribution to all items in the roadmap!
Data Sources
Offline Stores
Online Stores
Feature Engineering
Streaming
Deployments
Feature Serving
Data Quality Management (See RFC)
Feature Discovery and Governance
Natural Language Processing
The Trino offline store provides support for reading TrinoSources.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Trino as a table in order to complete join operations.
The Trino offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[trino]'
. You can then run feast init
, then swap out feature_store.yaml
with the below example to connect to Trino.
The full set of configuration options is available in TrinoOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Trino offline store.
Below is a matrix indicating which functionality is supported by TrinoRetrievalJob
.
To compare this set of functionality against other offline stores, please see the full functionality matrix.
The MsSQL offline store provides support for reading MsSQL Sources. Specifically, it is developed to read from Synapse SQL on Microsoft Azure
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe.
In order to use this offline store, you'll need to run pip install 'feast[azure]'
. You can get started by then following this tutorial.
The MsSQL offline store does not achieve full test coverage. Please do not assume complete stability.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Spark offline store.
Below is a matrix indicating which functionality is supported by MsSqlServerRetrievalJob
.
To compare this set of functionality against other offline stores, please see the full functionality matrix.
Here are the methods exposed by the OnlineStore
interface, along with the core functionality supported by the method:
online_write_batch
: write feature values to the online store
online_read
: read feature values from the online store
update
: update infrastructure (e.g. tables) in the online store
teardown
: teardown infrastructure (e.g. tables) in the online store
plan
: generate a plan of infrastructure changes based on feature repo changes
There is also additional functionality not properly captured by these interface methods:
support for on-demand transforms
readable by Python SDK
readable by Java
readable by Go
support for entityless feature views
support for concurrent writing to the same key
support for ttl (time to live) at retrieval
support for deleting expired data
Finally, there are multiple data models for storing the features in the online store. For example, features could be:
collocated by feature view
collocated by feature service
collocated by entity key
See this issue for a discussion around the tradeoffs of each of these data models.
There are currently five core online store implementations: SqliteOnlineStore
, RedisOnlineStore
, DynamoDBOnlineStore
, SnowflakeOnlineStore
, and DatastoreOnlineStore
. There are several additional implementations contributed by the Feast community (PostgreSQLOnlineStore
, HbaseOnlineStore
, CassandraOnlineStore
and IKVOnlineStore
), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific online store, such as how to configure it in a feature_store.yaml
, can be found here.
Below is a matrix indicating which online stores support what functionality.
Dragonfly is a modern in-memory datastore that implements novel algorithms and data structures on top of a multi-threaded, share-nothing architecture. Thanks to its API compatibility, Dragonfly can act as a drop-in replacement for Redis. Due to Dragonfly's hardware efficiency, you can run a single node instance on a small 8GB instance or scale vertically to large 768GB machines with 64 cores. This greatly reduces infrastructure costs as well as architectural complexity.
Similar to Redis, Dragonfly can be used as an online feature store for Feast.
Make sure you have Python and pip
installed.
Install the Feast SDK and CLI
pip install feast
In order to use Dragonfly as the online store, you'll need to install the redis extra:
pip install 'feast[redis]'
Bootstrap a new feature repository:
Update feature_repo/feature_store.yaml
with the below contents:
There are several options available to get Dragonfly up and running quickly. We will be using Docker for this tutorial.
docker run --network=host --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly
feast apply
The apply
command scans python files in the current directory (example_repo.py
in this case) for feature view/entity definitions, registers the objects, and deploys infrastructure. You should see the following output:
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the Redis online store.
To compare this set of functionality against other online stores, please see the full functionality matrix.
The Redis online store provides support for materializing feature values into Redis.
Both Redis and Redis Cluster are supported.
The data model used to store feature values in Redis is described in more detail here.
In order to use this online store, you'll need to install the redis extra (along with the dependency needed for the offline store of choice). E.g.
pip install 'feast[gcp, redis]'
pip install 'feast[snowflake, redis]'
pip install 'feast[aws, redis]'
pip install 'feast[azure, redis]'
You can get started by using any of the other templates (e.g. feast init -t gcp
or feast init -t snowflake
or feast init -t aws
), and then swapping in Redis as the online store as seen below in the examples.
Connecting to a single Redis instance:
Connecting to a Redis Cluster with SSL enabled and password authentication:
Connecting to a Redis Sentinel with SSL enabled and password authentication:
Additionally, the redis online store also supports automatically deleting data via a TTL mechanism. The TTL is applied at the entity level, so feature values from any associated feature views for an entity are removed together. This TTL can be set in the feature_store.yaml
, using the key_ttl_seconds
field in the online store. For example:
The full set of configuration options is available in RedisOnlineStoreConfig.
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the Redis online store.
To compare this set of functionality against other online stores, please see the full functionality matrix.
The Snowflake online store provides support for materializing feature values into a Snowflake Transient Table for serving online features.
Only the latest feature values are persisted
The data model for using a Snowflake Transient Table as an online store follows a tall format (one row per feature)):
"entity_feature_key" (BINARY) -- unique key used when reading specific feature_view x entity combination
"entity_key" (BINARY) -- repeated key currently unused for reading entity_combination
"feature_name" (VARCHAR)
"value" (BINARY)
"event_ts" (TIMESTAMP)
"created_ts" (TIMESTAMP)
(This model may be subject to change when Snowflake Hybrid Tables are released)
In order to use this online store, you'll need to run pip install 'feast[snowflake]'
. You can then get started with the command feast init REPO_NAME -t snowflake
.
"snowflake-online-store/online_path": Adding the "snowflake-online-store/online_path" key to a FeatureView tags parameter allows you to choose the online table path for the online serving table (ex. "{database}"."{schema}").
The full set of configuration options is available in SnowflakeOnlineStoreConfig.
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the Snowflake online store.
To compare this set of functionality against other online stores, please see the full functionality matrix.
The Datastore online store provides support for materializing feature values into Cloud Datastore. The data model used to store feature values in Datastore is described in more detail here.
In order to use this online store, you'll need to run pip install 'feast[gcp]'
. You can then get started with the command feast init REPO_NAME -t gcp
.
The full set of configuration options is available in DatastoreOnlineStoreConfig.
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the Datastore online store.
To compare this set of functionality against other online stores, please see the full functionality matrix.
IKV is a fully-managed embedded key-value store, primarily designed for storing ML features. Most key-value stores (think Redis or Cassandra) need a remote database cluster, whereas IKV allows you to utilize your existing application infrastructure to store data (cost efficient) and access it without any network calls (better performance). See detailed performance benchmarks and cost comparison with Redis on https://inlined.io. IKV can be used as an online-store in Feast, the rest of this guide goes over the setup.
Make sure you have Python and pip
installed.
Install the Feast SDK and CLI: pip install feast
In order to use this online store, you'll need to install the IKV extra (along with the dependency needed for the offline store of choice). E.g.
pip install 'feast[gcp, ikv]'
pip install 'feast[snowflake, ikv]'
pip install 'feast[aws, ikv]'
pip install 'feast[azure, ikv]'
You can get started by using any of the other templates (e.g. feast init -t gcp
or feast init -t snowflake
or feast init -t aws
), and then swapping in IKV as the online store as seen below in the examples.
Go to https://inlined.io or email onboarding[at]inlined.io
Update my_feature_repo/feature_store.yaml
with the below contents:
After provisioning an IKV account/store, you should have an account id, passkey and store-name. Additionally you must specify a mount-directory - where IKV will pull/update (maintain) a copy of the index for online reads (IKV is an embedded database). It can be skipped only if you don't plan to read any data from this container. The mount directory path usually points to a location on local/remote disk.
The full set of configuration options is available in IKVOnlineStoreConfig at sdk/python/feast/infra/online_stores/contrib/ikv_online_store/ikv.py
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the IKV online store.
To compare this set of functionality against other online stores, please see the full functionality matrix.
The SQLite online store provides support for materializing feature values into an SQLite database for serving online features.
All feature values are stored in an on-disk SQLite database
Only the latest feature values are persisted
The full set of configuration options is available in SqliteOnlineStoreConfig.
The set of functionality supported by online stores is described in detail here. Below is a matrix indicating which functionality is supported by the Sqlite online store.
Sqlite | |
---|---|
To compare this set of functionality against other online stores, please see the full functionality matrix.
Please see Online Store for an explanation of online stores.
This remote online store will let you interact with remote feature server. At this moment this only supports the read operation. You can use this online store and able retrieve online features store.get_online_features
from remote feature server.
The registry is pointing to registry of remote feature store. If it is not accessible then should be configured to use remote registry.
cert
is an optional configuration to the public certificate path when the online server starts in TLS(SSL) mode. This may be needed if the online server is started with a self-signed certificate, typically this file ends with *.crt
, *.cer
, or *.pem
.
The online store provides support for materializing feature values into Cloud Bigtable. The data model used to store feature values in Bigtable is described in more detail .
In order to use this online store, you'll need to run pip install 'feast[gcp]'
. You can then get started with the command feast init REPO_NAME -t gcp
.
The full set of configuration options is available in .
The PostgreSQL online store provides support for materializing feature values into a PostgreSQL database for serving online features.
Only the latest feature values are persisted
sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional
In order to use this online store, you'll need to run pip install 'feast[postgres]'
. You can get started by then running feast init -t postgres
.
The vector_len
parameter can be used to specify the length of the vector. The default value is 512.
Please make sure to follow the instructions in the repository, which, as the time of this writing, requires you to run CREATE EXTENSION vector;
in the database.
Then you can use retrieve_online_documents
to retrieve the top k closest vectors to a query vector. For the Retrieval Augmented Generation (RAG) use-case, you have to embed the query prior to passing the query vector.
The online store provides support for materializing feature values into AWS DynamoDB.
In order to use this online store, you'll need to run pip install 'feast[aws]'
. You can then get started with the command feast init REPO_NAME -t aws
.
The full set of configuration options is available in .
Feast requires the following permissions in order to execute commands for DynamoDB online store:
The following inline policy can be used to grant Feast the necessary permissions:
The top-level namespace within Feast is a project. Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store. You can read more about Feast projects in the .
For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.
For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page ()
Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).
Feast supports several patterns of feature retrieval.
Use case | Example | API |
---|
Trino | |
---|---|
Trino | |
---|---|
MsSql | |
---|---|
MsSql | |
---|---|
Redis | |
---|---|
Redis | |
---|---|
Snowflake | |
---|---|
Datastore | |
---|---|
IKV | |
---|---|
Please refer the for more details on how to configure authentication and authorization.
The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Bigtable online store.
Bigtable |
---|
To compare this set of functionality against other online stores, please see the full .
The full set of configuration options is available in .
The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Postgres online store.
Postgres |
---|
To compare this set of functionality against other online stores, please see the full .
The Postgres online store supports the use of for storing feature values. To enable PGVector, set vector_enabled: true
in the online store configuration.
Lastly, this IAM role needs to be associated with the desired Redshift cluster. Please follow the official AWS guide for the necessary steps .
The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the DynamoDB online store.
DynamoDB |
---|
To compare this set of functionality against other online stores, please see the full .
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
no
write_logged_features
(persist logged features to offline store)
no
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
no
preview the query plan before execution
yes
read partitioned data
yes
get_historical_features
(point-in-time correct join)
yes
pull_latest_from_table_or_query
(retrieve latest feature values)
yes
pull_all_from_table_or_query
(retrieve a saved dataset)
yes
offline_write_batch
(persist dataframes to offline store)
no
write_logged_features
(persist logged features to offline store)
no
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
local execution of Python-based on-demand transforms
no
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
yes
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
no
readable by Go
yes
support for entityless feature views
yes
support for concurrent writing to the same key
no
support for ttl (time to live) at retrieval
no
support for deleting expired data
no
collocated by feature view
yes
collocated by feature service
no
collocated by entity key
no
Sqlite
Redis
DynamoDB
Snowflake
Datastore
Postgres
Hbase
write feature values to the online store
yes
yes
yes
yes
yes
yes
yes
yes
yes
read feature values from the online store
yes
yes
yes
yes
yes
yes
yes
yes
yes
update infrastructure (e.g. tables) in the online store
yes
yes
yes
yes
yes
yes
yes
yes
yes
teardown infrastructure (e.g. tables) in the online store
yes
yes
yes
yes
yes
yes
yes
yes
yes
generate a plan of infrastructure changes
yes
no
no
no
no
no
no
yes
no
support for on-demand transforms
yes
yes
yes
yes
yes
yes
yes
yes
yes
readable by Python SDK
yes
yes
yes
yes
yes
yes
yes
yes
yes
readable by Java
no
yes
no
no
no
no
no
no
no
readable by Go
yes
yes
no
no
no
no
no
no
no
support for entityless feature views
yes
yes
yes
yes
yes
yes
yes
yes
yes
support for concurrent writing to the same key
no
yes
no
no
no
no
no
no
yes
support for ttl (time to live) at retrieval
no
yes
no
no
no
no
no
no
no
support for deleting expired data
no
yes
no
no
no
no
no
no
no
collocated by feature view
yes
no
yes
yes
yes
yes
yes
yes
no
collocated by feature service
no
no
no
no
no
no
no
no
no
collocated by entity key
no
yes
no
no
no
no
no
no
yes
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
no
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
yes
readable by Go
yes
support for entityless feature views
yes
support for concurrent writing to the same key
yes
support for ttl (time to live) at retrieval
yes
support for deleting expired data
yes
collocated by feature view
no
collocated by feature service
no
collocated by entity key
yes
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
no
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
yes
readable by Go
yes
support for entityless feature views
yes
support for concurrent writing to the same key
yes
support for ttl (time to live) at retrieval
yes
support for deleting expired data
yes
collocated by feature view
no
collocated by feature service
no
collocated by entity key
yes
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
no
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
no
readable by Go
no
support for entityless feature views
yes
support for concurrent writing to the same key
no
support for ttl (time to live) at retrieval
no
support for deleting expired data
no
collocated by feature view
yes
collocated by feature service
no
collocated by entity key
no
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
no
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
no
readable by Go
no
support for entityless feature views
yes
support for concurrent writing to the same key
no
support for ttl (time to live) at retrieval
no
support for deleting expired data
no
collocated by feature view
yes
collocated by feature service
no
collocated by entity key
no
write feature values to the online store
yes
read feature values from the online store
yes
update infrastructure (e.g. tables) in the online store
yes
teardown infrastructure (e.g. tables) in the online store
yes
generate a plan of infrastructure changes
no
support for on-demand transforms
yes
readable by Python SDK
yes
readable by Java
no
readable by Go
no
support for entityless feature views
yes
support for concurrent writing to the same key
yes
support for ttl (time to live) at retrieval
no
support for deleting expired data
no
collocated by feature view
no
collocated by feature service
no
collocated by entity key
yes
write feature values to the online store | yes |
read feature values from the online store | yes |
update infrastructure (e.g. tables) in the online store | yes |
teardown infrastructure (e.g. tables) in the online store | yes |
generate a plan of infrastructure changes | no |
support for on-demand transforms | yes |
readable by Python SDK | yes |
readable by Java | no |
readable by Go | no |
support for entityless feature views | yes |
support for concurrent writing to the same key | yes |
support for ttl (time to live) at retrieval | no |
support for deleting expired data | no |
collocated by feature view | yes |
collocated by feature service | no |
collocated by entity key | yes |
write feature values to the online store | yes |
read feature values from the online store | yes |
update infrastructure (e.g. tables) in the online store | yes |
teardown infrastructure (e.g. tables) in the online store | yes |
generate a plan of infrastructure changes | no |
support for on-demand transforms | yes |
readable by Python SDK | yes |
readable by Java | no |
readable by Go | no |
support for entityless feature views | yes |
support for concurrent writing to the same key | no |
support for ttl (time to live) at retrieval | no |
support for deleting expired data | no |
collocated by feature view | yes |
collocated by feature service | no |
collocated by entity key | no |
Command | Permissions | Resources |
Apply | dynamodb:CreateTable dynamodb:DescribeTable dynamodb:DeleteTable | arn:aws:dynamodb:<region>:<account_id>:table/* |
Materialize | dynamodb.BatchWriteItem | arn:aws:dynamodb:<region>:<account_id>:table/* |
Get Online Features | dynamodb.BatchGetItem | arn:aws:dynamodb:<region>:<account_id>:table/* |
write feature values to the online store | yes |
read feature values from the online store | yes |
update infrastructure (e.g. tables) in the online store | yes |
teardown infrastructure (e.g. tables) in the online store | yes |
generate a plan of infrastructure changes | no |
support for on-demand transforms | yes |
readable by Python SDK | yes |
readable by Java | no |
readable by Go | no |
support for entityless feature views | yes |
support for concurrent writing to the same key | no |
support for ttl (time to live) at retrieval | no |
support for deleting expired data | no |
collocated by feature view | yes |
collocated by feature service | no |
collocated by entity key | no |
Training data generation | Fetching user and item features for (user, item) pairs when training a production recommendation model |
|
Offline feature retrieval for batch predictions | Predicting user churn for all users on a daily basis |
|
Online feature retrieval for real-time model predictions | Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent |
|
Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.
In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.
This tutorial guides you on how to use Feast with . You will learn how to:
Train a model locally (on your laptop) using data from
Test the model for online inference using (for fast iteration)
Test the model for online inference using (for production use)
Try it and let us know what you think!
An Authorization Manager is an instance of the AuthManager
class that is plugged into one of the Feast servers to extract user details from the current request and inject them into the permission framework.
Note: Feast does not provide authentication capabilities; it is the client's responsibility to manage the authentication token and pass it to the Feast server, which then validates the token and extracts user details from the configured authentication server.
Two authorization managers are supported out-of-the-box:
One using a configurable OIDC server to extract the user details.
One using the Kubernetes RBAC resources to extract the user details.
These instances are created when the Feast servers are initialized, according to the authorization configuration defined in their own feature_store.yaml
.
Feast servers and clients must have consistent authorization configuration, so that the client proxies can automatically inject the authorization tokens that the server can properly identify and use to enforce permission validations.
The server-side implementation of the authorization functionality is defined here. Few of the key models, classes to understand the authorization implementation on the client side can be found here.
The authorization is configured using a dedicated auth
section in the feature_store.yaml
configuration.
Note: As a consequence, when deploying the Feast servers with the Helm charts, the feature_store_yaml_base64
value must include the auth
section to specify the authorization configuration.
This configuration applies the default no_auth
authorization:
With OIDC authorization, the Feast client proxies retrieve the JWT token from an OIDC server (or Identity Provider) and append it in every request to a Feast server, using an Authorization Bearer Token.
The server, in turn, uses the same OIDC server to validate the token and extract the user roles from the token itself.
Some assumptions are made in the OIDC server configuration:
The OIDC token refers to a client with roles matching the RBAC roles of the configured Permission
s (*)
The roles are exposed in the access token that is passed to the server
The JWT token is expected to have a verified signature and not be expired. The Feast OIDC token parser logic validates for verify_signature
and verify_exp
so make sure that the given OIDC provider is configured to meet these requirements.
The preferred_username should be part of the JWT token claim.
(*) Please note that the role match is case-sensitive, e.g. the name of the role in the OIDC server and in the Permission
configuration must be exactly the same.
For example, the access token for a client app
of a user with reader
role should have the following resource_access
section:
An example of feast OIDC authorization configuration on the server side is the following:
In case of client configuration, the following settings username, password and client_secret must be added to specify the current user:
Below is an example of feast full OIDC client auth configuration:
With Kubernetes RBAC Authorization, the client uses the service account token as the authorizarion bearer token, and the server fetches the associated roles from the Kubernetes RBAC resources.
An example of Kubernetes RBAC authorization configuration is the following:
NOTE: This configuration will only work if you deploy feast on Openshift or a Kubernetes platform.
```yaml project: my-project auth: type: kubernetes ... ```
In case the client cannot run on the same cluster as the servers, the client token can be injected using the LOCAL_K8S_TOKEN
environment variable on the client side. The value must refer to the token of a service account created on the servers cluster and linked to the desired RBAC roles.
To ensure the Kubernetes RBAC environment aligns with Feast's RBAC configuration, follow these guidelines:
The roles defined in Feast Permission
instances must have corresponding Kubernetes RBAC Role
names.
The Kubernetes RBAC Role
must reside in the same namespace as the Feast service.
The client application can run in a different namespace, using its own dedicated ServiceAccount
.
Finally, the RoleBinding
that links the client ServiceAccount
to the RBAC Role
must be defined in the namespace of the Feast service.
If the above rules are satisfied, the Feast service must be granted permissions to fetch RoleBinding
instances from the local namespace.