1 of 100

master

Introduction

What is Feast?

Feast (Feature Store) is an feature store that helps teams operate production ML systems at scale by allowing them to define, manage, validate, and serve features for production AI/ML.

Feast's feature store is composed of two foundational components: (1) an for historical feature extraction used in model training and an (2) for serving features at low-latency in production systems and applications.

Feast is a configurable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models. For more details, please review our .

Concretely, Feast provides:

A Python SDK for programmatically defining features, entities, sources, and (optionally) transformations
A Python SDK for reading and writing features to configured offline and online data stores
An for reading and writing features (useful for non-python languages)
A for viewing and exploring information about features defined in the project
A for viewing and updating feature information

Feast allows ML platform teams to:

Make features consistently available for training and low-latency serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensures that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to real-time models, and from one data infra system to another.

Note: Feast today primarily addresses timestamped structured data.

Note: Feast uses a push model for online serving. This means that the feature store pushes feature values to the online store, which reduces the latency of feature retrieval. This is more efficient than a pull model, where the model serving system must make a request to the feature store to retrieve feature values. See for a more detailed discussion.

Who is Feast for?

Feast helps ML platform/MLOps teams with DevOps experience productionize real-time models. Feast also helps these teams build a feature platform that improves collaboration between data engineers, software engineers, machine learning engineers, and data scientists.

For Data Scientists: Feast is a tool where you can easily define, store, and retrieve your features for both model development and model deployment. By using Feast, you can focus on what you do best: build features that power your AI/ML models and maximize the value of your data.
For MLOps Engineers: Feast is a library that allows you to connect your existing infrastructure (e.g., online database, application server, microservice, analytical database, and orchestration tooling) that enables your Data Scientists to ship features for their models to production using a friendly SDK without having to be concerned with software engineering challenges that occur from serving real-time production systems. By using Feast, you can focus on maintaining a resilient system, instead of implementing features for Data Scientists.
For Data Engineers: Feast provides a centralized catalog for storing feature definitions, allowing one to maintain a single source of truth for feature data. It provides the abstraction for reading and writing to many different types of offline and online data stores. Using either the provided Python SDK or the feature server service, users can write data to the online and/or offline stores and then read that data out again in either low-latency online scenarios for model inference, or in batch scenarios for model training.
For AI Engineers: Feast provides a platform designed to scale your AI applications by enabling seamless integration of richer data and facilitating fine-tuning. With Feast, you can optimize the performance of your AI models while ensuring a scalable and efficient data pipeline.

What Feast is not?

Feast is not

An / system. Feast is not a general purpose data pipelining system. Users often leverage tools like to manage upstream data transformations. Feast does support some .
A data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like to make features consistently available.
A data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a lightweight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.
A database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time

Feast does not fully solve

reproducible model training / model backtesting / experiment management: Feast captures feature and model metadata, but does not version-control datasets / labels or manage train / test splits. Other tools like , , and are better suited for this.
batch feature engineering: Feast supports on-demand and streaming transformations. Feast is also investing in supporting batch transformations.
native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines.
lineage: Feast helps tie feature values to model versions, but is not a complete solution for capturing end-to-end lineage from raw data sources to model versions. Feast also has community contributed plugins with and .
data quality / drift detection: Feast has experimental integrations with , but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.

Example use cases

Many companies have used Feast to power real-world ML use cases such as:

Personalizing online recommendations by leveraging pre-computed historical user or item features.
Online fraud detection, using features that compare against (pre-computed) historical transaction patterns
Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch
Credit scoring, using pre-computed historical features to compute the probability of default

How can I get started?

The best way to learn Feast is to use it. Head over to our and try it out!

Explore the following resources to get started with Feast:

is the fastest way to get started with Feast
describes all important Feast API concepts
describes Feast's overall architecture.
shows full examples of using Feast in machine learning applications.
provides a more in-depth guide to using Feast.
contains detailed API and design documents.
contains resources for anyone who wants to contribute to Feast.

Blog

Welcome to the Feast blog! Here you'll find articles about feature store development, new features, and community updates.

Community & getting help

Links & Resources

Come say hi on Slack!
- As a part of the Linux Foundation, we ask community members to adhere to the Linux Foundation Code of Conduct
GitHub Repository: Find the complete Feast codebase on GitHub.
- Community Governance Doc: See the governance model of Feast, including who the maintainers are and how decisions are made.
Google Folder: This folder is used as a central repository for all Feast resources. For example:
- Design proposals in the form of Request for Comments (RFC).
- User surveys and meeting minutes.
- Slide decks of conferences our contributors have spoken at.
Feast Linux Foundation Wiki: Our LFAI wiki page contains links to resources for contributors and maintainers.

How can I get help?

GitHub Issues: Found a bug or need a feature? Create an issue on GitHub.

Getting started

Architecture

Overview

Feast's architecture is designed to be flexible and scalable. It is composed of several components that work together to provide a feature store that can be used to serve features for training and inference.

Feast uses a to ingest data from different sources and store feature values in the online store. This allows Feast to serve features in real-time with low latency.
Feast supports for On Demand and Streaming data sources and will support Batch transformations in the future. For Streaming and Batch data sources, Feast requires a separate (in the batch case, this is typically your Offline Store). We are exploring adding a default streaming engine to Feast.
Domain expertise is recommended when integrating a data source with Feast understand the to your application
We recommend for your Feature Store microservice. As mentioned in the document, precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Because of this we believe the pros of Python outweigh the costs, as reimplementing feature logic is undesirable. Java and Go Clients are also available for online feature retrieval.
is a security mechanism that restricts access to resources based on the roles/groups/namespaces of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups can access or modify specific resources, thereby maintaining data security and operational integrity.

Language

Use Python to serve your features.

Why should you use Python to Serve features for Machine Learning?

Python has emerged as the primary language for machine learning, and this extends to feature serving and there are five main reasons Feast recommends using a microservice written in Python.

1. Python is the language of Machine Learning

You should meet your users where they are. Python’s popularity in the machine learning community is undeniable. Its simplicity and readability make it an ideal language for writing and understanding complex algorithms. Python boasts a rich ecosystem of libraries such as TensorFlow, PyTorch, XGBoost, and scikit-learn, which provide robust support for developing and deploying machine learning models and we want Feast in this ecosystem.

2. Precomputation is The Way

Precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Precomputation ensures product experiences for downstream services are also fast. Slow user experiences are bad user experiences. Precompute and persist data as much as you can.

3. Serving features in another language can lead to skew

Ensuring that features used during model training (offline serving) and online serving are available in production to make real-time predictions is critical. When features are initially developed, they are typically written in Python. This is due to the convenience and efficiency provided by Python's data manipulation libraries. However, in a production environment, there is often interest or pressure to rewrite these features in a different language, like Java, Go, or C++, for performance reasons. This reimplementation introduces a significant risk: training and serving skew. Note that there will always be some minor exceptions (e.g., any Time Since Last Event types of features) but this should not be the rule.

Training and serving skew occurs when there are discrepancies between the features used during model training and those used during prediction. This can lead to degraded model performance, unreliable predictions, and reduced velocity in releasing new features and new models. The process of rewriting features in another language is prone to errors and inconsistencies, which exacerbate this issue.

4. Reimplementation is Excessive

Rewriting features in another language is not only risky but also resource-intensive. It requires significant time and effort from engineers to ensure that the features are correctly translated. This process can introduce bugs and inconsistencies, further increasing the risk of training and serving skew. Additionally, maintaining two versions of the same feature codebase adds unnecessary complexity and overhead. More importantly, the opportunity cost of this work is high and requires twice the amount of resourcing. Reimplementing code should only be done when the performance gains are worth the investment. Features should largely be precomputed so the latency performance gains should not be the highest impact work that your team can accomplish.

5. Use existing Python Optimizations

Rather than switching languages, it is more efficient to optimize the performance of your feature store while keeping Python as the primary language. Optimization is a two step process.

Step 1: Quantify latency bottlenecks in your feature calculations

Use tools like to understand latency bottlenecks in your code. This will help you prioritize the biggest inefficiencies first. When we initially launched Python native transformations in Python, helped us identify that Pandas resulted in a 10x overhead due to type conversion.

Step 2: Optimize your feature calculations

As mentioned, precomputation is the recommended path. In some cases, you may want fully synchronous writes from your data producer to your online feature store, in which case you will want your feature computations and writes to be very fast. In this case, we recommend optimizing the feature calculation code first.

You should optimize your code using libraries, tools, and caching. For example, identify whether your feature calculations can be optimized through vectorized calculations in NumPy; explore tools like Numba for faster execution; and cache frequently accessed data using tools like an lru_cache.

Lastly, Feast will continue to optimize serving in Python and making the overall infrastructure more performant. This will better serve the community.

So we recommend focusing on optimizing your feature-specific code, reporting latency bottlenecks to the maintainers, and contributing to help the infrastructure be more performant.

By keeping features in Python and optimizing performance, you can ensure consistency between training and serving, reduce the risk of errors, and focus on launching more product experiences for your customers.

Embrace Python for feature serving, and leverage its strengths to build robust and reliable machine learning systems.

Push vs Pull Model

Feast uses a Push Model, i.e., Data Producers push data to the feature store and Feast stores the feature values in the online store, to serve features in real-time.

In a Pull Model, Feast would pull data from the data producers at request time and store the feature values in the online store before serving them (storing them would actually be unnecessary). This approach would incur additional network latency as Feast would need to orchestrate a request to each data producer, which would mean the latency would be at least as long as your slowest call. So, in order to serve features as fast as possible, we push data to Feast and store the feature values in the online store.

The trade-off with the Push Model is that strong consistency is not guaranteed out of the box. Instead, strong consistency has to be explicitly designed for in orchestrating the updates to Feast and the client usage.

The significant advantage with this approach is that Feast is read-optimized for low-latency feature retrieval.

How to Push

Implicit in the Push model are decisions about how and when to push feature values to the online store.

From a developer's perspective, there are three ways to push feature values to the online store with different tradeoffs.

They are discussed further in the Write Patterns section.

Feature Transformation

A feature transformation is a function that takes some set of input data and returns some set of output data. Feature transformations can happen on either raw data or derived data.

Feature Transformation Engines

Feature transformations can be executed by three types of "transformation engines":

The Feast Feature Server
An Offline Store (e.g., Snowflake, BigQuery, DuckDB, Spark, etc.)

The three transformation engines are coupled with the .

Importantly, this implies that different feature transformation code may be used under different transformation engines, so understanding the tradeoffs of when to use which transformation engine/communication pattern is extremely critical to the success of your implementation.

In general, we recommend transformation engines and network calls to be chosen by aligning it with what is most appropriate for the data producer, feature/model usage, and overall product.

API

feature_transformation

feature_transformation or udf are the core APIs for defining feature transformations in Feast. They allow you to specify custom logic that can be applied to the data during materialization or retrieval. Examples include:

Aggregation

Aggregation is builtin API for defining batch or streamable aggregations on data. It allows you to specify how to aggregate data over a time window, such as calculating the average or sum of a feature over a specified period. Examples include:

Filter

ttl: They amount of time that the features will be available for materialization or retrieval. The entity rows' timestamp higher that the current time minus the ttl will be used to filter the features. This is useful for ensuring that only recent data is used in feature calculations. Examples include:

Join

Feast can join multiple feature views together to create a composite feature view. This allows you to combine features from different sources or views into a single view. Examples include:

The underlying implementation of the join is an inner join by default, and join key is the entity id.

Feature Serving and Model Inference

Note: this ML Infrastructure diagram highlights an orchestration pattern that is driven by a client application. This is not the only approach that can be taken and different patterns will result in different trade-offs.

Production machine learning systems can choose from four approaches to serving machine learning predictions (the output of model inference):

Online model inference with online features
Offline mode inference without online features
Online model inference with online features and cached predictions
Online model inference without features

Note: online features can be sourced from batch, streaming, or request data sources.

These three approaches have different tradeoffs but, in general, have significant implementation differences.

1. Online Model Inference with Online Features

Online model inference with online features is a powerful approach to serving data-driven machine learning applications. This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe). This approach is particularly useful for applications where request-time data is required to run inference.

features = store.get_online_features(
    feature_refs=[
        "user_data:click_through_rate",
        "user_data:number_of_clicks",
        "user_data:average_page_duration",
    ],
    entity_rows=[{"user_id": 1}],
)
model_predictions = model_server.predict(features)

2. Offline Model Inference without Online Features

Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement. This approach simply treats the model predictions as a feature and serves them from the feature store using the standard Feast sdk. These model predictions are typically generated through some batch process where the model scores are precomputed. As a concrete example, the batch process can be as simple as a script that runs model inference locally for a set of users that can output a CSV. This output file could be used for materialization so that the model could be served online as shown in the code below.

model_predictions = store.get_online_features(
    feature_refs=[
        "user_data:model_predictions",
    ],
    entity_rows=[{"user_id": 1}],
)

Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and materialized to the online store.

While this approach can lead to quick impact for different business use cases, it suffers from stale data as well as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff may be tolerable.

3. Online Model Inference with Online Features and Cached Predictions

This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running model inference when data producers write features to the online store. This approach is particularly useful for applications where features are coming from multiple data sources, the model is computationally expensive to run, or latency is a significant constraint.

# Client Reads
features = store.get_online_features(
    feature_refs=[
        "user_data:click_through_rate",
        "user_data:number_of_clicks",
        "user_data:average_page_duration",
        "user_data:model_predictions",
    ],
    entity_rows=[{"user_id": 1}],
)
if features.to_dict().get('user_data:model_predictions') is None:
    model_predictions = model_server.predict(features)
    store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))

Note that in this case a seperate call to write_to_online_store is required when the underlying data changes and predictions change along with it.

# Client Writes from the Data Producer
user_data = request.POST.get('user_data')
model_predictions = model_server.predict(user_data) # assume this includes `user_data` in the Data Frame
store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))

While this requires additional writes for every data producer, this approach will result in the lowest latency for model inference.

4. Online Model Inference without Features

This approach does not require Feast. The model server can directly serve predictions without any features. This approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions.

Note that generative models using Retrieval Augmented Generation (RAG) do require features where the document embeddings are treated as features, which Feast supports (this would fall under "Online Model Inference with Online Features").

Client Orchestration

Implicit in the code examples above is a design choice about how clients orchestrate calls to get features and run model inference. The examples had a Feast-centric pattern because they are inputs to the model, so the sequencing is fairly obvious. An alternative approach can be Inference-centric where a client would call an inference endpoint and the inference service would be responsible for orchestration.

Role-Based Access Control (RBAC)

Introduction

Role-Based Access Control (RBAC) is a security mechanism that restricts access to resources based on the roles/groups/namespaces of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups/namespaces can access or modify specific resources, thereby maintaining data security and operational integrity.

Functional Requirements

The RBAC implementation in Feast is designed to:

Assign Permissions: Allow administrators to assign permissions for various operations and resources to users or groups/namespaces.
Seamless Integration: Integrate smoothly with existing business code without requiring significant modifications.
Backward Compatibility: Maintain support for non-authorized models as the default to ensure backward compatibility.

Business Goals

The primary business goals of implementing RBAC in the Feast are:

Feature Sharing: Enable multiple teams to share the feature store while ensuring controlled access. This allows for collaborative work without compromising data security.
Access Control Management: Prevent unauthorized access to team-specific resources and spaces, governing the operations that each user or group can perform.

Reference Architecture

Feast operates as a collection of connected services, each enforcing authorization permissions. The architecture is designed as a distributed microservices system with the following key components:

Service Endpoints: These enforce authorization permissions, ensuring that only authorized requests are processed.
Client Integration: Clients authenticate with feature servers by attaching authorization token to each request.
Service-to-Service Communication: This is always granted.

Permission Model

The RBAC system in Feast uses a permission model that defines the following concepts:

Resource: An object within Feast that needs to be secured against unauthorized access.
Action: A logical operation performed on a resource, such as Create, Describe, Update, Delete, Read, or write operations.
Policy: A set of rules that enforce authorization decisions on resources. The polices are based on user roles or groups or namespaces or combined.

Authorization Architecture

The authorization architecture in Feast is built with the following components:

Token Extractor: Extracts the authorization token from the request header.
Token Parser: Parses the token to retrieve user details.
Policy Enforcer: Validates the secured endpoint against the retrieved user details.
Token Injector: Adds the authorization token to each secured request header.

Concepts

Overview

Feast project structure

The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store. You can read more about Feast projects in the project page.

Data ingestion

For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.

For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page (Data ingestion)

Feature registration and retrieval

Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).

Feast supports several patterns of feature retrieval.

Use case

Example

API

Training data generation

Fetching user and item features for (user, item) pairs when training a production recommendation model

get_historical_features

Offline feature retrieval for batch predictions

Predicting user churn for all users on a daily basis

get_historical_features

Online feature retrieval for real-time model predictions

Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent

get_online_features

Project

Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev, staging, prod).

Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.

The concept of a "project" provide the following benefits:

Logical Grouping: Projects group related features together, making it easier to manage and track them.

Feature Definitions: Within a project, you can define features, including their metadata, types, and sources. This helps standardize how features are created and consumed.

Isolation: Projects provide a way to isolate different environments, such as development, testing, and production, ensuring that changes in one project do not affect others.

Collaboration: By organizing features within projects, teams can collaborate more effectively, with clear boundaries around the features they are responsible for.

Access Control: Projects can implement permissions, allowing different users or teams to access only the features relevant to their work.

Data ingestion

Data source

A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.

Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.

Below is an example data source with a single entity column (driver) and two feature columns (trips_today, and rating).

Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:

Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
- Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
- [Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into , which allow light-weight feature engineering and combining features across sources.

Batch data ingestion

Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).

A key command to use in Feast is the materialize_incremental command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.

When working with On Demand Feature Views with write_to_online_store=True, you can also control whether transformations are applied during ingestion by using the transform_on_write parameter. Setting transform_on_write=False allows you to materialize pre-transformed features without reapplying transformations, which is particularly useful for large batch datasets that have already been processed.

Materialization can be called programmatically or through the CLI:

Code example: programmatic scheduled materialization

This snippet creates a feature store object which points to the registry (which knows of all defined features) and the online store (DynamoDB in this case), and

Code example: CLI based materialization

How to run this in the CLI

With timestamps:

Simple materialization (for data without event timestamps):

How to run this on Airflow

Batch data schema inference

If the schema parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT clause and inspecting the result.

Stream data ingestion

Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.

To push data into the offline or online stores: see for details.
(experimental) To use a contrib Spark processor to ingest from a topic, see

Entity

An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.

driver = Entity(name='driver', join_keys=['driver_id'])

The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.

Entities are used by Feast in many contexts, as we explore below:

Use case #1: Defining and storing features

Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:

zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)

Feast refers to this collection of entities for a feature view as an entity key.

Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.

Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.

Use case #2: Retrieving features

At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch point-in-time correct features for to generate a training dataset.

At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).

Q: Can I retrieve features for all entities?

Kind of.

In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an open GitHub issue that welcomes contribution to make this a more intuitive API.

For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.

Point-in-time joins

Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today, and earnings_today):

The above table can be registered with Feast through the following feature view:

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

driver = Entity(name="driver", join_keys=["driver_id"])

driver_stats_fv = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    schema=[
        Field(name="trips_today", dtype=Int64),
        Field(name="earnings_today", dtype=Float32),
    ],
    ttl=timedelta(hours=2),
    source=FileSource(
        path="driver_hourly_stats.parquet"
    )
)

Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.

Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats feature view onto it, while preserving the trip_success column:

The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:

# Read in entity dataframe
entity_df = pd.read_csv("entity_df.csv")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features = [
        'driver_hourly_stats:trips_today',
        'driver_hourly_stats:earnings_today'
    ],
)

For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time specified.

Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).

Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:

Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.

[Alpha] Saved dataset

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

Dataset can be created from:

Results of historical retrieval
[planned] Logging request (including input for ) and response during feature serving
[planned] Logging features during writing to online store (from batch source or stream)

Creating a saved dataset from historical retrieval

To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features method first and then pass the returned retrieval job to create_saved_dataset method. create_saved_dataset will trigger the provided retrieval job (by calling .persist() on it) to store the data using the specified storage behind the scenes. Storage type must be the same as the globally configured offline store (e.g it's impossible to persist data to a different offline source). create_saved_dataset will also create a SavedDataset object with all of the related metadata and will write this object to the registry.

Saved dataset can be retrieved later using the get_saved_dataset method in the feature store:

Check out our to see how this concept can be applied in a real-world use case.

Components

Overview

Functionality

Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
Create Stream Features: Stream features are created from streaming services such as Kafka or Kinesis, and can be pushed directly into Feast via the .
Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply. This CLI command updates infrastructure and persists definitions in the object store registry.
Feast Materialize: The user (or scheduler) executes feast materialize (with timestamps or --disable-event-timestamp to materialize all data with current timestamps) which loads features from the offline store into the online store.
Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset that can be used for training models.
Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
Prediction: A backend system makes a request for a prediction from the model serving service.
Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
Feature Retrieval: The online serving service retrieves the latest feature values from the online store and returns them to the model serving service.

Components

A complete Feast deployment contains the following components:

Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
Feast Python SDK/CLI: The primary user facing SDK. Used to:
- Manage version controlled feature definitions.
- Materialize (load) feature values into the online store.
- Build and retrieve training datasets from the offline store.
- Retrieve online features.
Feature Server: The Feature Server is a REST API server that serves feature values for a given entity key and feature reference. The Feature Server is designed to be horizontally scalable and can be deployed in a distributed manner.
Stream Processor: The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
Compute Engine: The component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through .
Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
Authorization Manager: The authorization manager detects authentication tokens from client requests to Feast servers and uses this information to enforce permission policies on the requested services.

Registry

The Feast feature registry is a central catalog of all feature definitions and their related metadata. Feast uses the registry to store all applied Feast objects (e.g. Feature views, entities, etc). It allows data scientists to search, discover, and collaborate on new features. The registry exposes methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.

Feast comes with built-in file-based and sql-based registry implementations. By default, Feast uses a file-based registry, which stores the protobuf representation of the registry as a serialized file in the local file system. For more details on which registries are supported, please see Registries.

Updating the registry

We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD automatically stays synced with the registry. Users will often also want multiple registries to correspond to different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write access since they can impact real user traffic. See Running Feast in Production for details on how to set this up.

Accessing the registry from clients

Users can specify the registry through a feature_store.yaml config file, or programmatically. We often see teams preferring the programmatic approach because it makes notebook driven development very easy:

Option 1: programmatically specifying the registry

repo_config = RepoConfig(
    registry=RegistryConfig(path="gs://feast-test-gcs-bucket/registry.pb"),
    project="feast_demo_gcp",
    provider="gcp",
    offline_store="file",  # Could also be the OfflineStoreConfig e.g. FileOfflineStoreConfig
    online_store="null",  # Could also be the OnlineStoreConfig e.g. RedisOnlineStoreConfig
)
store = FeatureStore(config=repo_config)

Option 2: specifying the registry in the project's `feature_store.yaml` file

project: feast_demo_aws
provider: aws
registry: s3://feast-test-s3-bucket/registry.pb
online_store: null
offline_store:
  type: file

Instantiating a FeatureStore object can then point to this:

store = FeatureStore(repo_path=".")

The file-based feature registry is a Protobuf representation of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.

Offline store

An offline store is an interface for working with historical time-series feature values that are stored in . The OfflineStore interface has several different implementations, such as the BigQueryOfflineStore, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see .

Offline stores are primarily used for two reasons:

Building training datasets from time-series features.
Materializing (loading) features into an online store to serve those features at low-latency in a production setting.

Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.

Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery offline store cannot be used to query a file-based data source.

Please see for more details on how to push features directly to the offline store in your feature store.

Online store

Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize command (either with specific timestamps or using --disable-event-timestamp to materialize all data with current timestamps).

The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each , only the latest feature values are stored. No historical values are stored.

Here is an example batch data source:

Once the above data source is materialized into Feast (using feast materialize with timestamps or feast materialize --disable-event-timestamp), the feature values will be stored as follows:

Features can also be written directly to the online store via .

Feature server

The Feature Server is a core architectural component in Feast, designed to provide low-latency feature retrieval and updates for machine learning applications.

It is a REST API server built using FastAPI and exposes a limited set of endpoints to serve features, push data, and support materialization operations. The server is scalable, flexible, and designed to work seamlessly with various deployment environments, including local setups and cloud-based systems.

Motivation

In machine learning workflows, real-time access to feature values is critical for enabling low-latency predictions. The Feature Server simplifies this requirement by:

Serving Features: Allowing clients to retrieve feature values for specific entities in real-time, reducing the complexity of direct interactions with the online store.
Data Integration: Providing endpoints to push feature data directly into the online or offline store, ensuring data freshness and consistency.
Scalability: Supporting horizontal scaling to handle high request volumes efficiently.
Standardized API: Exposing HTTP/JSON endpoints that integrate seamlessly with various programming languages and ML pipelines.
Secure Communication: Supporting TLS (SSL) for secure data transmission in production environments.

Architecture

The Feature Server operates as a stateless service backed by two key components:

Online Store: The primary data store used for low-latency feature retrieval.
Registry: The metadata store that defines feature sets, feature views, and their relationships to entities.

Key Features

RESTful API: Provides standardized endpoints for feature retrieval and data pushing.
CLI Integration: Easily managed through the Feast CLI with commands like feast serve.
Flexible Deployment: Can be deployed locally, via Docker, or on Kubernetes using Helm charts.
Scalability: Designed for distributed deployments to handle large-scale workloads.
TLS Support: Ensures secure communication in production setups.

Endpoints Overview

Endpoint

Description

/get-online-features

Retrieves feature values for specified entities and feature references.

/push

Pushes feature data to the online and/or offline store.

/materialize

Materializes features within a specific time range to the online store.

/materialize-incremental

Incrementally materializes features up to the current timestamp.

/retrieve-online-documents

Supports Vector Similarity Search for RAG (Alpha end-ponit)

/docs

API Contract for available endpoints

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports as an offline store and as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

Authorization Manager

An Authorization Manager is an instance of the AuthManager class that is plugged into one of the Feast servers to extract user details from the current request and inject them into the permission framework.

Note: Feast does not provide authentication capabilities; it is the client's responsibility to manage the authentication token and pass it to the Feast server, which then validates the token and extracts user details from the configured authentication server.

Two authorization managers are supported out-of-the-box:

One using a configurable OIDC server to extract the user details.
One using the Kubernetes RBAC resources to extract the user details.

These instances are created when the Feast servers are initialized, according to the authorization configuration defined in their own feature_store.yaml.

Feast servers and clients must have consistent authorization configuration, so that the client proxies can automatically inject the authorization tokens that the server can properly identify and use to enforce permission validations.

Design notes

The server-side implementation of the authorization functionality is defined here. Few of the key models, classes to understand the authorization implementation on the client side can be found here.

Configuring Authorization

The authorization is configured using a dedicated auth section in the feature_store.yaml configuration.

Note: As a consequence, when deploying the Feast servers with the Helm charts, the feature_store_yaml_base64 value must include the auth section to specify the authorization configuration.

No Authorization

This configuration applies the default no_auth authorization:

project: my-project
auth:
  type: no_auth
...

OIDC Authorization

With OIDC authorization, the Feast client proxies retrieve the JWT token from an OIDC server (or Identity Provider) and append it in every request to a Feast server, using an Authorization Bearer Token.

The server, in turn, uses the same OIDC server to validate the token and extract the user roles from the token itself.

Some assumptions are made in the OIDC server configuration:

The OIDC token refers to a client with roles matching the RBAC roles of the configured Permissions (*)
The roles are exposed in the access token that is passed to the server
The JWT token is expected to have a verified signature and not be expired. The Feast OIDC token parser logic validates for verify_signature and verify_exp so make sure that the given OIDC provider is configured to meet these requirements.
The preferred_username should be part of the JWT token claim.

(*) Please note that the role match is case-sensitive, e.g. the name of the role in the OIDC server and in the Permission configuration must be exactly the same.

For example, the access token for a client app of a user with reader role should have the following resource_access section:

{
  "resource_access": {
    "app": {
      "roles": [
        "reader"
      ]
    }
  }
}

An example of feast OIDC authorization configuration on the server side is the following:

project: my-project
auth:
  type: oidc
  client_id: _CLIENT_ID__
  auth_discovery_url: _OIDC_SERVER_URL_/realms/master/.well-known/openid-configuration
...

In case of client configuration, the following settings username, password and client_secret must be added to specify the current user:

auth:
  type: oidc
  ...
  username: _USERNAME_
  password: _PASSWORD_
  client_secret: _CLIENT_SECRET__

Below is an example of feast full OIDC client auth configuration:

project: my-project
auth:
  type: oidc
  client_id: test_client_id
  client_secret: test_client_secret
  username: test_user_name
  password: test_password
  auth_discovery_url: http://localhost:8080/realms/master/.well-known/openid-configuration

Kubernetes RBAC Authorization

With Kubernetes RBAC Authorization, the client uses the service account token as the authorizarion bearer token, and the server fetches the associated roles from the Kubernetes RBAC resources. Feast supports advanced authorization by extracting user groups and namespaces from Kubernetes tokens, enabling fine-grained access control beyond simple role matching. This is achieved by leveraging Kubernetes Token Access Review, which allows Feast to determine the groups and namespaces associated with a user or service account.

An example of Kubernetes RBAC authorization configuration is the following:

NOTE: This configuration will only work if you deploy feast on Openshift or a Kubernetes platform.

```yaml project: my-project auth: type: kubernetes user_token: #Optional, else service account token Or env var is used for getting the token ... ```

In case the client cannot run on the same cluster as the servers, the client token can be injected using the LOCAL_K8S_TOKEN environment variable on the client side. The value must refer to the token of a service account created on the servers cluster and linked to the desired RBAC roles/groups/namespaces.

More details can be found in Setting up kubernetes doc

OpenTelemetry Integration

The OpenTelemetry integration in Feast provides comprehensive monitoring and observability capabilities for your feature serving infrastructure. This component enables you to track key metrics, traces, and logs from your Feast deployment.

Motivation

Monitoring and observability are critical for production machine learning systems. The OpenTelemetry integration addresses these needs by:

Performance Monitoring: Track CPU and memory usage of feature servers
Operational Insights: Collect metrics to understand system behavior and performance
Troubleshooting: Enable effective debugging through distributed tracing
Resource Optimization: Monitor resource utilization to optimize deployments
Production Readiness: Provide enterprise-grade observability capabilities

Architecture

The OpenTelemetry integration in Feast consists of several components working together:

OpenTelemetry Collector: Receives, processes, and exports telemetry data
Prometheus Integration: Enables metrics collection and monitoring
Instrumentation: Automatic Python instrumentation for tracking metrics
Exporters: Components that send telemetry data to monitoring systems

Key Features

Automated Instrumentation: Python auto-instrumentation for comprehensive metric collection
Metric Collection: Track key performance indicators including:
- Memory usage
- CPU utilization
- Request latencies
- Feature retrieval statistics
Flexible Configuration: Customizable metric collection and export settings
Kubernetes Integration: Native support for Kubernetes deployments
Prometheus Compatibility: Integration with Prometheus for metrics visualization

Setup and Configuration

To add monitoring to the Feast Feature Server, follow these steps:

1. Deploy Prometheus Operator

Follow the to install the operator.

2. Deploy OpenTelemetry Operator

Before installing the OpenTelemetry Operator:

Install cert-manager
Validate that the pods are running
Apply the OpenTelemetry operator:

For additional installation steps, refer to the .

3. Configure OpenTelemetry Collector

Add the OpenTelemetry Collector configuration under the metrics section in your values.yaml file:

4. Add Instrumentation Configuration

Add the following annotations and environment variables to your deployment.yaml:

5. Add Metric Checks

Add metric checks to all manifests and deployment files:

6. Add Required Manifests

Add the following components to your chart:

Instrumentation
OpenTelemetryCollector
ServiceMonitors
Prometheus Instance
RBAC rules

7. Deploy Feast

Deploy Feast with metrics enabled:

Usage

To enable OpenTelemetry monitoring in your Feast deployment:

Set metrics.enabled=true in your Helm values
Configure the OpenTelemetry Collector endpoint
Deploy with proper annotations and environment variables

Example configuration:

Monitoring

Once configured, you can monitor various metrics including:

feast_feature_server_memory_usage: Memory utilization of the feature server
feast_feature_server_cpu_usage: CPU usage statistics
Additional custom metrics based on your configuration

These metrics can be visualized using Prometheus and other compatible monitoring tools.

Third party integrations

We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.

Don't see your offline store or online store of choice here? Check out our guides to make a custom one!

Adding a new offline store
Adding a new online store

Integrations

See Functionality and Roadmap

Standards

In order for a plugin integration to be highlighted, it must meet the following requirements:

The plugin must have tests. Ideally it would use the Feast universal tests (see this guide for an example), but custom tests are fine.
The plugin must have some basic documentation on how it should be used.
The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).

In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:

The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.
There is documentation and a tutorial on how to use the integration.
The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.
If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.

Tutorials

Sample use-case tutorials

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

This tutorial guides you on how to use Feast with . You will learn how to:

Train a model locally (on your laptop) using data from
Test the model for online inference using (for fast iteration)
Test the model for online inference using (for production use)

Try it and let us know what you think!

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our end-to-end example will perform the following workflows:

Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data

Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

Real-time Credit Scoring Example

This end-to-end tutorial will take you through the following steps:

Deploying S3 with Parquet as your primary data source, containing both loan features and zip code features
Deploying Redshift as the interface Feast uses to build training datasets
Registering your features with Feast and configuring DynamoDB for online serving
Building a training dataset with Feast to train your credit scoring model
Loading feature values from S3 into DynamoDB
Making online predictions with your credit scoring model using features from DynamoDB

Building streaming features

Feast supports registering streaming feature views and Kafka and Kinesis streaming sources. It also provides an interface for stream processing called the Stream Processor. An example Kafka/Spark StreamProcessor is implemented in the contrib folder. For more details, please see the RFC for more details.

Please see here for a tutorial on how to build a versioned streaming pipeline that registers your transformations, features, and data sources in Feast.

RAG Fine Tuning with Feast and Milvus

Introduction

This example notebook provides a step-by-step demonstration of building and using a RAG system with Feast and the custom FeastRagRetriever. The notebook walks through:

Data Preparation
- Loads a subset of the Wikipedia DPR dataset (1% of training data)
- Implements text chunking with configurable chunk size and overlap
- Processes text into manageable passages with unique IDs
Embedding Generation
- Uses all-MiniLM-L6-v2 sentence transformer model
- Generates 384-dimensional embeddings for text passages
- Demonstrates batch processing with GPU support
Feature Store Setup
- Creates a Parquet file as the historical data source
- Configures Feast with the feature repository
- Demonstrates writing embeddings from data source to Milvus online store which can be used for model training later
RAG System Implementation
- Embedding Model: all-MiniLM-L6-v2 (configurable)
- Generator Model: granite-3.2-2b-instruct (configurable)
- Vector Store: Custom implementation with Feast integration
- Retriever: Custom implementation extending HuggingFace's RagRetriever
Query Demonstration
- Perform inference with retrieved context

Requirements

A Kubernetes cluster with:
- GPU nodes available (for model inference)
- At least 200GB of storage
- A standalone Milvus deployment. See example here.

Running the example

Clone this repository: https://github.com/feast-dev/feast.git Navigate to the examples/rag-retriever directory. Here you will find the following files:

feature_repo/feature_store.yaml This is the core configuration file for the RAG project's feature store, configuring a Milvus online store on a local provider.
- In order to configure Milvus you should:
  - Update feature_store.yaml with your Milvus connection details:
    host
    port (default: 19530)
    credentials (if required)
feature_repo/ragproject_repo.py This is the Feast feature repository configuration that defines the schema and data source for Wikipedia passage embeddings.
rag_feast.ipynb This is a notebook demonstrating the implementation of a RAG system using Feast. The notebook provides:
- A complete end-to-end example of building a RAG system with:
  - Data preparation using the Wiki DPR dataset
  - Text chunking and preprocessing
  - Vector embedding generation using sentence-transformers
  - Integration with Milvus vector store
  - Inference utilising a custom RagRetriever: FeastRagRetriever
- Uses all-MiniLM-L6-v2 for generating embeddings
- Implements granite-3.2-2b-instruct as the generator model

Open rag_feast.ipynb and follow the steps in the notebook to run the example.

FeastRagRetriver Low Level Design

Helpful Information

Ensure your Milvus instance is properly configured and running
Vector dimensions and similarity metrics can be adjusted in the feature store configuration
The example uses Wikipedia data, but the system can be adapted for other datasets

MCP - AI Agent Example

This example demonstrates how to enable MCP (Model Context Protocol) support in Feast, allowing AI agents and applications to interact with your features through standardized MCP interfaces.

Prerequisites

Python 3.8+
Feast installed
FastAPI MCP library

Installation

Install Feast with MCP support:

pip install feast[mcp]

Alternatively, you can install the dependencies separately:

pip install feast
pip install fastapi_mcp

Setup

Navigate to this example directory within your cloned Feast repository:

cd examples/mcp_feature_store

Initialize a Feast repository in this directory. We'll use the existing feature_store.yaml that's already configured for MCP:

feast init .

This will create a data subdirectory and a feature_repo subdirectory if they don't exist, and will use the feature_store.yaml present in the current directory (examples/mcp_feature_store).

Apply the feature store configuration:

cd feature_repo 
feast apply
cd .. # Go back to examples/mcp_feature_store for the next steps

Starting the MCP-Enabled Feature Server

Start the Feast feature server with MCP support:

feast serve --host 0.0.0.0 --port 6566

If MCP is properly configured, you should see a log message indicating that MCP support has been enabled:

INFO:feast.feature_server:MCP support has been enabled for the Feast feature server

Available MCP Tools

The fastapi_mcp integration automatically exposes your Feast feature server's FastAPI endpoints as MCP tools. This means AI assistants can:

Call /get-online-features to retrieve features from the feature store
Use /health to check server status

Configuration Details

The key configuration that enables MCP support:

feature_server:
    type: mcp                    # Use MCP feature server type
    enabled: true               # Enable feature server
    mcp_enabled: true           # Enable MCP protocol support
    mcp_server_name: "feast-feature-store"
    mcp_server_version: "1.0.0"

How-to Guides

Running Feast with Snowflake/GCP/AWS

Install Feast

Install Feast using :

Install Feast with Snowflake dependencies (required when using Snowflake):

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use feast init command:

The init command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:

Enter the directory:

You can now use this feature repository for development. You can try the following:

Run feast apply to apply these definitions to Feast.
Edit the example feature definitions in example.py and run feast apply again to change feature definitions.
Initialize a git repository in the same directory and checking the feature repository into version control.

Deploy a feature store

The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your feature_store.yaml file, as well as the feature definitions found in your feature repository.

Here we'll be using the example repository we created in the previous guide, Create a feature store. You can re-create it by running feast init in a new directory.

Deploying

To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

feast apply

# Processing example.py as example
# Done!

Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize. See Load data into the online store for more details.

Cleaning up

If you need to clean up the infrastructure created by feast apply, use the teardown command.

Warning: teardown is an irreversible command and will remove all feature store infrastructure. Proceed with caution!

feast teardown

****

Build a training dataset

Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.

Retrieving historical features

1. Register your feature views

Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.

2. Define feature references

Start by defining the feature references (e.g., driver_trips:average_daily_rides) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.

3. Create an entity dataframe

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.

It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.

Pandas:

In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp column and a driver_id entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.

SQL (Alternative):

Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).

4. Launch historical retrieval

Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features(). This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

Load data into the online store

Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

Materializing features

1. Register feature views

Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

2.a Materialize

The materialize command allows users to materialize features over a specific historical time range into the online store.

The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

It is also possible to materialize for specific feature views by using the -v / --views argument.

The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

2.b Materialize Incremental (Alternative)

For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize, materialize-incremental will track the state of previous ingestion runs inside of the feature registry.

The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00).

The materialize-incremental command functions similarly to materialize in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.

Unlike materialize, materialize-incremental automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental, the end timestamp will be tracked.

Subsequent runs of materialize-incremental will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Retrieving online features

1. Ensure that feature values have been loaded into the online store

Please ensure that you have materialized (loaded) your feature values into the online store before starting

2. Define feature references

Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

features = [
    "driver_hourly_stats:conv_rate",
    "driver_hourly_stats:acc_rate"
]

3. Read online features

Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

fs = FeatureStore(repo_path="path/to/feature/repo")
online_features = fs.get_online_features(
    features=features,
    entity_rows=[
        # {join_key: entity_value, ...}
        {"driver_id": 1001},
        {"driver_id": 1002}]
).to_dict()

{
   "driver_hourly_stats__acc_rate":[
      0.2897740304470062,
      0.6447265148162842
   ],
   "driver_hourly_stats__conv_rate":[
      0.6508077383041382,
      0.14802511036396027
   ],
   "driver_id":[
      1001,
      1002
   ]
}

Scaling Feast

Overview

Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.

Scaling Feast Registry

The default Feast registry is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

The recommended solution in this case is to use the SQL based registry, which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).

Scaling Materialization

The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.

Feast supports pluggable Compute Engines, that allow the materialization process to be scaled up. Aside from the local process, Feast supports a Lambda-based materialization engine, and a Bytewax-based materialization engine.

Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.

Customizing Feast

Feast is highly pluggable and configurable:

One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.
The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.

Below are some guides on how to add new custom components:

Adding a custom batch materialization engine

Reference

Type System

Motivation

Feast uses an internal type system to provide guarantees on training and serving data. Feast currently supports eight primitive types - INT32, INT64, FLOAT32, FLOAT64, STRING, BYTES, BOOL, and UNIX_TIMESTAMP - and the corresponding array types. Null types are not supported, although the UNIX_TIMESTAMP type is nullable. The type system is controlled by Value.proto in protobuf and by types.py in Python. Type conversion logic can be found in type_map.py.

Examples

Feature inference

During feast apply, Feast runs schema inference on the data sources underlying feature views. For example, if the schema parameter is not specified for a feature view, Feast will examine the schema of the underlying data source to determine the event timestamp column, feature columns, and entity columns. Each of these columns must be associated with a Feast type, which requires conversion from the data source type system to the Feast type system.

The feature inference logic calls _infer_features_and_entities.
_infer_features_and_entities calls source_datatype_to_feast_value_type.
source_datatype_to_feast_value_type cals the appropriate method in type_map.py. For example, if a SnowflakeSource is being examined, snowflake_python_type_to_feast_value_type from type_map.py will be called.

Materialization

Feast serves feature values as Value proto objects, which have a type corresponding to Feast types. Thus Feast must materialize feature values into the online store as Value proto objects.

The local materialization engine first pulls the latest historical features and converts it to pyarrow.
Then it calls _convert_arrow_to_proto to convert the pyarrow table to proto format.
This calls python_values_to_proto_values in type_map.py to perform the type conversion.

Historical feature retrieval

The Feast type system is typically not necessary when retrieving historical features. A call to get_historical_features will return a RetrievalJob object, which allows the user to export the results to one of several possible locations: a Pandas dataframe, a pyarrow table, a data lake (e.g. S3 or GCS), or the offline store (e.g. a Snowflake table). In all of these cases, the type conversion is handled natively by the offline store. For example, a BigQuery query exposes a to_dataframe method that will automatically convert the result to a dataframe, without requiring any conversions within Feast.

Feature serving

As mentioned above in the section on materialization, Feast persists feature values into the online store as Value proto objects. A call to get_online_features will return an OnlineResponse object, which essentially wraps a bunch of Value protos with some metadata. The OnlineResponse object can then be converted into a Python dictionary, which calls feast_value_type_to_python_type from type_map.py, a utility that converts the Feast internal types to Python native types.

Data sources

Please see for a conceptual explanation of data sources.

Overview

Functionality

In Feast, each batch data source is associated with corresponding offline stores. For example, a SnowflakeSource can only be processed by the Snowflake offline store, while a FileSource can be processed by both File and DuckDB offline stores. Otherwise, the primary difference between batch data sources is the set of supported types. Feast has an internal type system, and aims to support eight primitive types (bytes, string, int32, int64, float32, float64, bool, and timestamp) along with the corresponding array types. However, not every batch data source supports all of these types.

For more details on the Feast type system, see here.

Functionality Matrix

There are currently four core batch data source implementations: FileSource, BigQuerySource, SnowflakeSource, and RedshiftSource. There are several additional implementations contributed by the Feast community (PostgreSQLSource, SparkSource, and TrinoSource), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific data source can be found here.

Below is a matrix indicating which data sources support which types.

File

BigQuery

Snowflake

Redshift

Postgres

Spark

Trino

Couchbase

bytes

yes

string

yes

int32

yes

int64

yes

float32

yes

float64

yes

bool

yes

timestamp

yes

array types

yes

File

Description

File data sources are files on disk or on S3. Currently only Parquet and Delta formats are supported.

Example

The full set of configuration options is available .

Supported Types

File data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Snowflake

Description

Snowflake data sources are Snowflake tables or views. These can be specified either by a table reference or a SQL query.

Examples

Using a table reference:

from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    database="FEAST",
    schema="PUBLIC",
    table="FEATURE_TABLE",
)

Using a query:

from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    query="""
    SELECT
        timestamp_column AS "ts",
        "created",
        "f1",
        "f2"
    FROM
        `FEAST.PUBLIC.FEATURE_TABLE`
      """,
)

Be careful about how Snowflake handles table and column name conventions. In particular, you can read more about quote identifiers here.

The full set of configuration options is available here.

Supported Types

Snowflake data sources support all eight primitive types. Array types are also supported but not with type inference. For a comparison against other batch data sources, please see here.

BigQuery

Description

BigQuery data sources are BigQuery tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table reference:

from feast import BigQuerySource

my_bigquery_source = BigQuerySource(
    table_ref="gcp_project:bq_dataset.bq_table",
)

Using a query:

from feast import BigQuerySource

BigQuerySource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM `my_project.my_dataset.my_features`",
)

The full set of configuration options is available here.

Supported Types

BigQuery data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.

Redshift

Description

Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table name:

from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    table="redshift_table",
)

Using a query:

from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM redshift_table",
)

The full set of configuration options is available here.

Supported Types

Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see here.

Kafka

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Description

Kafka sources allow users to register Kafka streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kafka. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through FeatureStore.write_to_online_store. An example of how to launch such a job with Spark can be found here. Feast also provides functionality to write to the offline store using the write_to_offline_store functionality.

Kafka sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kafka streams to a batch data source such as a data warehouse table. When using a Kafka source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance

Example

Defining a Kafka source

Note that the Kafka source has a batch source.

from datetime import timedelta

from feast import Field, FileSource, KafkaSource, stream_feature_view
from feast.data_format import JsonFormat
from feast.types import Float32

driver_stats_batch_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
)

driver_stats_stream_source = KafkaSource(
    name="driver_stats_stream",
    kafka_bootstrap_servers="localhost:9092",
    topic="drivers",
    timestamp_field="event_timestamp",
    batch_source=driver_stats_batch_source,
    message_format=JsonFormat(
        schema_json="driver_id integer, event_timestamp timestamp, conv_rate double, acc_rate double, created timestamp"
    ),
    watermark_delay_threshold=timedelta(minutes=5),
)

Using the Kafka source in a stream feature view

The Kafka source can be used in a stream feature view.

@stream_feature_view(
    entities=[driver],
    ttl=timedelta(seconds=8640000000),
    mode="spark",
    schema=[
        Field(name="conv_percentage", dtype=Float32),
        Field(name="acc_percentage", dtype=Float32),
    ],
    timestamp_field="event_timestamp",
    online=True,
    source=driver_stats_stream_source,
)
def driver_hourly_stats_stream(df: DataFrame):
    from pyspark.sql.functions import col

    return (
        df.withColumn("conv_percentage", col("conv_rate") * 100.0)
        .withColumn("acc_percentage", col("acc_rate") * 100.0)
        .drop("conv_rate", "acc_rate")
    )

Ingesting data

See here for a example of how to ingest data from a Kafka source into Feast.

Kinesis

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Description

Kinesis sources allow users to register Kinesis streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kinesis. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through FeatureStore.write_to_online_store. An example of how to launch such a job with Spark to ingest from Kafka can be found here; by using a different plugin, the example can be adapted to Kinesis. Feast also provides functionality to write to the offline store using the write_to_offline_store functionality.

Kinesis sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kinesis streams to a batch data source such as a data warehouse table. When using a Kinesis source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

Raw events come in (stream 1)
Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)
Write stream 2 values to an offline store as a historical log for training (optional)
Write stream 2 values to an online store for low latency feature serving
Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance

Example

Defining a Kinesis source

Note that the Kinesis source has a batch source.

from datetime import timedelta

from feast import Field, FileSource, KinesisSource, stream_feature_view
from feast.data_format import JsonFormat
from feast.types import Float32

driver_stats_batch_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
)

driver_stats_stream_source = KinesisSource(
    name="driver_stats_stream",
    stream_name="drivers",
    timestamp_field="event_timestamp",
    batch_source=driver_stats_batch_source,
    record_format=JsonFormat(
        schema_json="driver_id integer, event_timestamp timestamp, conv_rate double, acc_rate double, created timestamp"
    ),
    watermark_delay_threshold=timedelta(minutes=5),
)

Using the Kinesis source in a stream feature view

The Kinesis source can be used in a stream feature view.

@stream_feature_view(
    entities=[driver],
    ttl=timedelta(seconds=8640000000),
    mode="spark",
    schema=[
        Field(name="conv_percentage", dtype=Float32),
        Field(name="acc_percentage", dtype=Float32),
    ],
    timestamp_field="event_timestamp",
    online=True,
    source=driver_stats_stream_source,
)
def driver_hourly_stats_stream(df: DataFrame):
    from pyspark.sql.functions import col

    return (
        df.withColumn("conv_percentage", col("conv_rate") * 100.0)
        .withColumn("acc_percentage", col("acc_rate") * 100.0)
        .drop("conv_rate", "acc_rate")
    )

Ingesting data

See here for a example of how to ingest data from a Kafka source into Feast. The approach used in the tutorial can be easily adapted to work for Kinesis as well.

Spark (contrib)

Description

Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.

New in Feast: SparkSource now supports advanced table formats including Apache Iceberg, Delta Lake, and Apache Hudi, enabling ACID transactions, time travel, and schema evolution capabilities. See the Table Formats guide for detailed documentation.

Disclaimer

The Spark data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Basic Examples

Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    table="FEATURE_TABLE",
)

Using a query:

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM spark_table",
)

Using a file reference:

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    path=f"{CURRENT_DIR}/data/driver_hourly_stats",
    file_format="parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

Table Format Examples

SparkSource supports advanced table formats for modern data lakehouse architectures. For detailed documentation, configuration options, and best practices, see the Table Formats guide.

Apache Iceberg

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
from feast.table_format import IcebergFormat

iceberg_format = IcebergFormat(
    catalog="my_catalog",
    namespace="my_database"
)

my_spark_source = SparkSource(
    name="user_features",
    path="my_catalog.my_database.user_table",
    table_format=iceberg_format,
    timestamp_field="event_timestamp"
)

Delta Lake

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
from feast.table_format import DeltaFormat

delta_format = DeltaFormat()

my_spark_source = SparkSource(
    name="transaction_features",
    path="s3://my-bucket/delta-tables/transactions",
    table_format=delta_format,
    timestamp_field="transaction_timestamp"
)

Apache Hudi

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
from feast.table_format import HudiFormat

hudi_format = HudiFormat(
    table_type="COPY_ON_WRITE",
    record_key="user_id",
    precombine_field="updated_at"
)

my_spark_source = SparkSource(
    name="user_profiles",
    path="s3://my-bucket/hudi-tables/user_profiles",
    table_format=hudi_format,
    timestamp_field="event_timestamp"
)

For advanced configuration including time travel, incremental queries, and performance tuning, see the Table Formats guide.

Configuration Options

The full set of configuration options is available here.

Table Format Options

IcebergFormat: See Table Formats - Iceberg
DeltaFormat: See Table Formats - Delta Lake
HudiFormat: See Table Formats - Hudi

Supported Types

Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.

PostgreSQL (contrib)

Description

PostgreSQL data sources are PostgreSQL tables or views. These can be specified either by a table reference or a SQL query.

Disclaimer

The PostgreSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Postgres source:

from feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source import (
    PostgreSQLSource,
)

driver_stats_source = PostgreSQLSource(
    name="feast_driver_hourly_stats",
    query="SELECT * FROM feast_driver_hourly_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

The full set of configuration options is available here.

Supported Types

PostgreSQL data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.

Trino (contrib)

Description

Trino data sources are Trino tables or views. These can be specified either by a table reference or a SQL query.

Disclaimer

The Trino data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Trino source:

from feast.infra.offline_stores.contrib.trino_offline_store.trino_source import (
    TrinoSource,
)

driver_hourly_stats = TrinoSource(
    timestamp_field="event_timestamp",
    table="feast.driver_stats",
    created_timestamp_column="created",
)

The full set of configuration options is available here.

Supported Types

Trino data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see here.

Azure Synapse + Azure SQL (contrib)

Description

MsSQL data sources are Microsoft sql table sources. These can be specified either by a table reference or a SQL query.

Disclaimer

The MsSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a MsSQL source:

from feast.infra.offline_stores.contrib.mssql_offline_store.mssqlserver_source import (
    MsSqlServerSource,
)

driver_hourly_table = "driver_hourly"

driver_source = MsSqlServerSource(
    table_ref=driver_hourly_table,
    event_timestamp_column="datetime",
    created_timestamp_column="created",
)

Couchbase (contrib)

Description

Couchbase Columnar data sources are Couchbase Capella Columnar collections that can be used as a source for feature data. Note that Couchbase Columnar is available through Couchbase Capella.

Disclaimer

The Couchbase Columnar data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Couchbase Columnar source:

from feast.infra.offline_stores.contrib.couchbase_offline_store.couchbase_source import (
    CouchbaseColumnarSource,
)

driver_stats_source = CouchbaseColumnarSource(
    name="driver_hourly_stats_source",
    query="SELECT * FROM Default.Default.`feast_driver_hourly_stats`",
    database="Default",
    scope="Default",
    collection="feast_driver_hourly_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

The full set of configuration options is available here.

Supported Types

Couchbase Capella Columnar data sources support BOOLEAN, STRING, BIGINT, and DOUBLE primitive types. For a comparison against other batch data sources, please see here.

Offline stores

Please see for a conceptual explanation of offline stores.

Overview

Functionality

Here are the methods exposed by the OfflineStore interface, along with the core functionality supported by the method:

get_historical_features: point-in-time correct join to retrieve historical features
pull_latest_from_table_or_query: retrieve latest feature values for materialization into the online store
pull_all_from_table_or_query: retrieve a saved dataset
offline_write_batch: persist dataframes to the offline store, primarily for push sources
write_logged_features: persist logged features to the offline store, for feature logging

The first three of these methods all return a RetrievalJob specific to an offline store, such as a SnowflakeRetrievalJob. Here is a list of functionality supported by RetrievalJobs:

export to dataframe
export to arrow table
export to arrow batches (to handle large datasets in memory)
export to SQL
export to data lake (S3, GCS, etc.)
export to data warehouse
export as Spark dataframe
local execution of Python-based on-demand transforms
remote execution of Python-based on-demand transforms
persist results in the offline store
preview the query plan before execution (RetrievalJobs are lazily executed)
read partitioned data

Functionality Matrix

There are currently four core offline store implementations: DaskOfflineStore, BigQueryOfflineStore, SnowflakeOfflineStore, and RedshiftOfflineStore. There are several additional implementations contributed by the Feast community (PostgreSQLOfflineStore, SparkOfflineStore, TrinoOfflineStore, and RayOfflineStore), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific offline store, such as how to configure it in a feature_store.yaml, can be found here.

Below is a matrix indicating which offline stores support which methods.

|| | Dask | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino | Couchbase | Ray | || :-------------------------------- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | || get_historical_features | yes | yes | yes | yes | yes | yes | yes | yes | yes | || pull_latest_from_table_or_query | yes | yes | yes | yes | yes | yes | yes | yes | yes | || pull_all_from_table_or_query | yes | yes | yes | yes | yes | yes | yes | yes | yes | || offline_write_batch | yes | yes | yes | yes | no | no | no | no | yes | || write_logged_features | yes | yes | yes | yes | no | no | no | no | yes |

Below is a matrix indicating which RetrievalJobs support what functionality.

|| | Dask | BigQuery | Snowflake | Redshift | Postgres | Spark | Trino | DuckDB | Couchbase | Ray | || --------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | || export to dataframe | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | || export to arrow table | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | || export to arrow batches | no | no | no | yes | no | no | no | no | no | no | || export to SQL | no | yes | yes | yes | yes | no | yes | no | yes | no | || export to data lake (S3, GCS, etc.) | no | no | yes | no | yes | no | no | no | yes | yes | || export to data warehouse | no | yes | yes | yes | yes | no | no | no | yes | no | || export as Spark dataframe | no | no | yes | no | no | yes | no | no | no | no | || local execution of Python-based on-demand transforms | yes | yes | yes | yes | yes | no | yes | yes | yes | yes | || remote execution of Python-based on-demand transforms | no | no | no | no | no | no | no | no | no | no | || persist results in the offline store | yes | yes | yes | yes | yes | yes | no | yes | yes | yes | || preview the query plan before execution | yes | yes | yes | yes | yes | yes | yes | no | yes | yes | || read partitioned data | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |

Dask

Description

The Dask offline store provides support for reading FileSources.

All data is downloaded and joined using Python and therefore may not scale to production workloads.

Example

feature_store.yaml

project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: dask

The full set of configuration options is available in DaskOfflineStoreConfig.

Functionality Matrix

The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the dask offline store.

Dask

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

Below is a matrix indicating which functionality is supported by DaskRetrievalJob.

Dask

export to dataframe

yes

export to arrow table

yes

export to arrow batches

export to SQL

export to data lake (S3, GCS, etc.)

export to data warehouse

export as Spark dataframe

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

To compare this set of functionality against other offline stores, please see the full functionality matrix.

BigQuery

Description

The BigQuery offline store provides support for reading .

All joins happen within BigQuery.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to BigQuery as a table (marked for expiration) in order to complete join operations.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[gcp]'. You can get started by then running feast init -t gcp.

Example

The full set of configuration options is available in .

Functionality Matrix

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the BigQuery offline store.

BigQuery

Below is a matrix indicating which functionality is supported by BigQueryRetrievalJob.

BigQuery

*See for details on proposed solutions for enabling the BigQuery offline store to understand tables that use _PARTITIONTIME as the partition column.

To compare this set of functionality against other offline stores, please see the full .

Spark (contrib)

Description

The Spark offline store provides support for reading .

Entity dataframes can be provided as a SQL query, Pandas dataframe or can be provided as a Pyspark dataframe. A Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.

Disclaimer

The Spark offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[spark]'. You can get started by then running feast init -t spark.

Example

The full set of configuration options is available in .

Functionality Matrix

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.

Spark

Below is a matrix indicating which functionality is supported by SparkRetrievalJob.

Spark

To compare this set of functionality against other offline stores, please see the full .

Clickhouse (contrib)

Description

The Clickhouse offline store provides support for reading .

Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Clickhouse as a table (temporary table by default) in order to complete join operations.

Disclaimer

The Clickhouse offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[clickhouse]'.

Example

Note that use_temporary_tables_for_entity_df is an optional parameter. The full set of configuration options is available in .

Functionality Matrix

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Clickhouse offline store.

Clickhouse

Below is a matrix indicating which functionality is supported by ClickhouseRetrievalJob.

Clickhouse

To compare this set of functionality against other offline stores, please see the full .

Adding a new offline store

Overview

Feast makes adding support for a new offline store easy. Developers can simply implement the OfflineStore interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).

In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.

The full working code for this guide can be found at feast-dev/feast-custom-offline-store-demo.

The process for using a custom offline store consists of 8 steps:

Defining an OfflineStore class.
Defining an OfflineStoreConfig class.
Defining a RetrievalJob class for this offline store.
Defining a DataSource class for the offline store
Referencing the OfflineStore in a feature repo's feature_store.yaml file.
Testing the OfflineStore class.
Updating dependencies.
Adding documentation.

1. Defining an OfflineStore class

OfflineStore class names must end with the OfflineStore suffix!

Contrib offline stores

New offline stores go in sdk/python/feast/infra/offline_stores/contrib/.

What is a contrib plugin?

Not guaranteed to implement all interface methods
Not guaranteed to be stable.
Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.

How do I make a contrib plugin an "official" plugin?

To move an offline store plugin out of contrib, you need:

GitHub actions (i.e make test-python-integration) is setup to run all tests against the offline store and pass.
At least two contributors own the plugin (ideally tracked in our OWNERS / CODEOWNERS file).

Define the offline store class

The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.

To fully implement the interface for the offline store, you will need to implement these methods:

pull_latest_from_table_or_query is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method. This method pull data from the offline store, and the FeatureStore class takes care of writing this data into the online store.
get_historical_features is invoked when reading values from the offline store using the FeatureStore.get_historical_features() method. Typically, this method is used to retrieve features when training ML models.
(optional) offline_write_batch is a method that supports directly pushing a pyarrow table to a feature view. Given a feature view with a specific schema, this function should write the pyarrow table to the batch source defined. More details about the push api can be found here. This method only needs implementation if you want to support the push api in your offline store.
(optional) pull_all_from_table_or_query is a method that pulls all the data from an offline store from a specified start date to a specified end date. This method is only used for SavedDatasets as part of data quality monitoring validation.
(optional) write_logged_features is a method that takes a pyarrow table or a path that points to a parquet file and writes the data to a defined source defined by LoggingSource and LoggingConfig. This method is only used internally for SavedDatasets.

feast_custom_offline_store/file.py

    # Only prints out runtime warnings once.
    warnings.simplefilter("once", RuntimeWarning)

    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:
        """ Perform point-in-time correct join of features onto an entity dataframe(entity key and timestamp). More details about how this should work at https://docs.feast.dev/v/v0.6-branch/user-guide/feature-retrieval#3.-historical-feature-retrieval.
        print("Getting historical features from my offline store")."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def pull_latest_from_table_or_query(self,
                                        config: RepoConfig,
                                        data_source: DataSource,
                                        join_key_columns: List[str],
                                        feature_name_columns: List[str],
                                        timestamp_field: str,
                                        created_timestamp_column: Optional[str],
                                        start_date: datetime,
                                        end_date: datetime) -> RetrievalJob:
        """ Pulls data from the offline store for use in materialization."""
        print("Pulling latest features from my offline store")
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def pull_all_from_table_or_query(
        config: RepoConfig,
        data_source: DataSource,
        join_key_columns: List[str],
        feature_name_columns: List[str],
        timestamp_field: str,
        start_date: datetime,
        end_date: datetime,
    ) -> RetrievalJob:
        """ Optional method that returns a Retrieval Job for all join key columns, feature name columns, and the event timestamp columns that occur between the start_date and end_date."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def write_logged_features(
        config: RepoConfig,
        data: Union[pyarrow.Table, Path],
        source: LoggingSource,
        logging_config: LoggingConfig,
        registry: BaseRegistry,
    ):
        """ Optional method to have Feast support logging your online features."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def offline_write_batch(
        config: RepoConfig,
        feature_view: FeatureView,
        table: pyarrow.Table,
        progress: Optional[Callable[[int], Any]],
    ):
        """ Optional method to have Feast support the offline push api for your offline store."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

1.1 Type Mapping

Most offline stores will have to perform some custom mapping of offline store datatypes to feast value types.

The function to implement here are source_datatype_to_feast_value_type and get_column_names_and_types in your DataSource class.
source_datatype_to_feast_value_type is used to convert your DataSource's datatypes to feast value types.
get_column_names_and_types retrieves the column names and corresponding datasource types.

Add any helper functions for type conversion to sdk/python/feast/type_map.py.

Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data.

2. Defining an OfflineStoreConfig class

Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.

To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined here.

The FeastConfigBaseModel is a pydantic class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

This config class must container a type field, which contains the fully qualified class name of its corresponding OfflineStore class.

Additionally, the name of the config class must be the same as the OfflineStore class, with the Config suffix.

An example of the config class for the custom file offline store :

feast_custom_offline_store/file.py

class CustomFileOfflineStoreConfig(FeastConfigBaseModel):
    """ Custom offline store config for local (file-based) store """

    type: Literal["feast_custom_offline_store.file.CustomFileOfflineStore"] \
        = "feast_custom_offline_store.file.CustomFileOfflineStore"

    uri: str # URI for your offline store(in this case it would be a path)

This configuration can be specified in the feature_store.yaml as follows:

feature_repo/feature_store.yaml

project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: feast_custom_offline_store.file.CustomFileOfflineStore
    uri: <File URI>
online_store:
    path: data/online_store.db

This configuration information is available to the methods of the OfflineStore, via the config: RepoConfig parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store field of the config parameter. This fields in the feature_store.yaml should map directly to your OfflineStoreConfig class that is detailed above in Section 2.

feast_custom_offline_store/file.py

    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        offline_store_config = config.offline_store
        assert isinstance(offline_store_config, CustomFileOfflineStoreConfig)
        store_type = offline_store_config.type

3. Defining a RetrievalJob class

The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob instance, which represents the execution of the actual query against the underlying store.

Custom offline stores may need to implement their own instances of the RetrievalJob interface.

The RetrievalJob interface exposes two methods - to_df and to_arrow. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.

Users who want to have their offline store support scalable batch materialization for online use cases (detailed in this RFC) will also need to implement to_remote_storage to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom Materialization Engine to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize).

feast_custom_offline_store/file.py

class CustomFileRetrievalJob(RetrievalJob):
    def __init__(self, evaluation_function: Callable):
        """Initialize a lazy historical retrieval job"""

        # The evaluation function executes a stored procedure to compute a historical retrieval.
        self.evaluation_function = evaluation_function

    def to_df(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return df

    def to_arrow(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return pyarrow.Table.from_pandas(df)

    def to_remote_storage(self):
        # Optional method to write to an offline storage location to support scalable batch materialization.
        pass

4. Defining a DataSource class for the offline store

Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource base class needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.

The data source class should implement two methods - from_proto, and to_proto.

For custom offline stores that are not being implemented in the main feature repo, the custom_options field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto method and reading the value back from bytes in the from_proto method.

feast_custom_offline_store/file.py

class CustomFileDataSource(FileSource):
    """Custom data source class for local files"""
    def __init__(
        self,
        timestamp_field: Optional[str] = "",
        path: Optional[str] = None,
        field_mapping: Optional[Dict[str, str]] = None,
        created_timestamp_column: Optional[str] = "",
        date_partition_column: Optional[str] = "",
    ):
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        super(CustomFileDataSource, self).__init__(
            timestamp_field=timestamp_field,
            created_timestamp_column,
            field_mapping,
            date_partition_column,
        )
        self._path = path


    @staticmethod
    def from_proto(data_source: DataSourceProto):
        custom_source_options = str(
            data_source.custom_options.configuration, encoding="utf8"
        )
        path = json.loads(custom_source_options)["path"]
        return CustomFileDataSource(
            field_mapping=dict(data_source.field_mapping),
            path=path,
            timestamp_field=data_source.timestamp_field,
            created_timestamp_column=data_source.created_timestamp_column,
            date_partition_column=data_source.date_partition_column,
        )

    def to_proto(self) -> DataSourceProto:
        config_json = json.dumps({"path": self.path})
        data_source_proto = DataSourceProto(
            type=DataSourceProto.CUSTOM_SOURCE,
            custom_options=DataSourceProto.CustomSourceOptions(
                configuration=bytes(config_json, encoding="utf8")
            ),
        )

        data_source_proto.timestamp_field = self.timestamp_field
        data_source_proto.created_timestamp_column = self.created_timestamp_column
        data_source_proto.date_partition_column = self.date_partition_column

        return data_source_proto

5. Using the custom offline store

After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the offline_store field. The value specified should be the fully qualified class name of the OfflineStore.

As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our custom file offline store, we can use the following feature_store.yaml:

feature_repo/feature_store.yaml

project: test_custom
registry: data/registry.db
provider: local
offline_store:
    # Make sure to specify the type as the fully qualified path that Feast can import.
    type: feast_custom_offline_store.file.CustomFileOfflineStore

If additional configuration for the offline store is not required, then we can omit the other fields and only specify the type of the offline store class as the value for the offline_store.

feature_repo/feature_store.yaml

project: test_custom
registry: data/registry.db
provider: local
offline_store: feast_custom_offline_store.file.CustomFileOfflineStore

Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.

feature_repo/repo.py

driver_hourly_stats = CustomFileDataSource(
    path="feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)


driver_hourly_stats_view = FeatureView(
    source=driver_hourly_stats,
    ...
)

6. Testing the OfflineStore class

Integrating with the integration test suite and unit test suite.

Even if you have created the OfflineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.

In order to test against the test suite, you need to create a custom DataSourceCreator that implement our testing infrastructure methods, create_data_source and optionally, created_saved_dataset_destination.
- create_data_source should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See BigQueryDataSourceCreator for an implementation of a data source creator.
- created_saved_dataset_destination is invoked when users need to save the dataset for use in data validation. This functionality is still in alpha and is optional.
Make sure that your offline store doesn't break any unit tests first by running:
```
make test-python-unit
```
Next, set up your offline store to run the universal integration tests. These are integration tests specifically intended to test offline and online stores against Feast API functionality, to ensure that the Feast APIs works with your offline store.
- Feast parametrizes integration tests using the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py which stores different offline store classes for testing.
- To overwrite the default configurations to use your own offline store, you can simply create your own file that contains a FULL_REPO_CONFIGS dictionary, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. The module should add new IntegrationTestRepoConfig classes to the AVAILABLE_OFFLINE_STORES by defining an offline store that you would like Feast to test with.
A sample FULL_REPO_CONFIGS_MODULE looks something like this:
```
# Should go in sdk/python/feast/infra/offline_stores/contrib/postgres_repo_configuration.py
from feast.infra.offline_stores.contrib.postgres_offline_store.tests.data_source import (
    PostgreSQLDataSourceCreator,
)

AVAILABLE_OFFLINE_STORES = [("local", PostgreSQLDataSourceCreator)]
```
You should swap out the FULL_REPO_CONFIGS environment variable and run the integration tests against your offline store. In the example repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_offline_store/feast_tests.py, so you would run:
```
export FULL_REPO_CONFIGS_MODULE='feast_custom_offline_store.feast_tests'
make test-python-universal
```
If the integration tests fail, this indicates that there is a mistake in the implementation of this offline store!
Remember to add your datasource to repo_config.py similar to how we added spark, trino, etc, to the dictionary OFFLINE_STORE_CLASS_FOR_TYPE. This will allow Feast to load your class from the feature_store.yaml.
Finally, add a Makefile target to the Makefile to run your datastore specific tests by setting the FULL_REPO_CONFIGS_MODULE and PYTEST_PLUGINS environment variable. The PYTEST_PLUGINS environment variable allows pytest to load in the DataSourceCreator for your datasource. You can remove certain tests that are not relevant or still do not work for your datastore using the -k option.

Makefile

test-python-universal-spark:
	PYTHONPATH='.' \
	FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.offline_stores.contrib.spark_repo_configuration \
	PYTEST_PLUGINS=feast.infra.offline_stores.contrib.spark_offline_store.tests \
    IS_TEST=True \
 	python -m pytest -n 8 --integration \
 	 	-k "not test_historical_retrieval_fails_on_validation and \
			not test_historical_retrieval_with_validation and \
			not test_historical_features_persisting and \
			not test_historical_retrieval_fails_on_validation and \
			not test_universal_cli and \
			not test_go_feature_server and \
			not test_feature_logging and \
			not test_reorder_columns and \
			not test_logged_features_validation and \
			not test_lambda_materialization_consistency and \
			not test_offline_write and \
			not test_push_features_to_offline_store.py and \
			not gcs_registry and \
			not s3_registry and \
			not test_universal_types" \
 	 sdk/python/tests

7. Dependencies

Add any dependencies for your offline store to our sdk/python/setup.py under a new <OFFLINE_STORE>__REQUIRED list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. You will need to regenerate our requirements files:

make lock-python-ci-dependencies-all

8. Add Documentation

Remember to add documentation for your offline store.

Add a new markdown file to docs/reference/offline-stores/ and docs/reference/data-sources/. Use these files to document your offline store functionality similar to how the other offline stores are documented.
You should also add a reference in docs/reference/data-sources/README.md and docs/SUMMARY.md to these markdown files.

NOTE: Be sure to document the following things about your offline store:

How to create the datasource and most what configuration is needed in the feature_store.yaml file in order to create the datasource.
Make sure to flag that the datasource is in alpha development.
Add some documentation on what the data model is for the specific offline store for more clarity.
Finally, generate the python code docs by running:

make build-sphinx

master

Introduction

What is Feast?

Who is Feast for?

What Feast is not?

Feast is not

Feast does not fully solve

Example use cases

How can I get started?

Blog

Featured Posts

Community & getting help

Links & Resources

How can I get help?

Getting started

Architecture

Overview

Language

Why should you use Python to Serve features for Machine Learning?

1. Python is the language of Machine Learning

2. Precomputation is The Way

3. Serving features in another language can lead to skew

4. Reimplementation is Excessive

5. Use existing Python Optimizations

Step 1: Quantify latency bottlenecks in your feature calculations

Step 2: Optimize your feature calculations

Push vs Pull Model

Push vs Pull Model

How to Push

Feature Transformation

Feature Transformation Engines

API

feature_transformation

Aggregation

Filter

Join

Feature Serving and Model Inference

1. Online Model Inference with Online Features

2. Offline Model Inference without Online Features

3. Online Model Inference with Online Features and Cached Predictions

4. Online Model Inference without Features

Client Orchestration

Role-Based Access Control (RBAC)

Introduction

Functional Requirements

Business Goals

Reference Architecture

Permission Model

Authorization Architecture

Concepts

Overview

Feast project structure

Data ingestion

Feature registration and retrieval

Project

Data ingestion

Data source

Batch data ingestion

Batch data schema inference

Stream data ingestion

Entity

Use case #1: Defining and storing features

Use case #2: Retrieving features

Point-in-time joins

[Alpha] Saved dataset

Creating a saved dataset from historical retrieval

Tags

Overview

Examples

Components

Overview

Functionality

Components

Registry

Updating the registry

Accessing the registry from clients

Option 1: programmatically specifying the registry

Option 2: specifying the registry in the project's feature_store.yaml file

Offline store

Online store

Option 2: specifying the registry in the project's `feature_store.yaml` file