1 of 8

Architecture

Overview

Feast's architecture is designed to be flexible and scalable. It is composed of several components that work together to provide a feature store that can be used to serve features for training and inference.

Feast uses a Push Model to ingest data from different sources and store feature values in the online store. This allows Feast to serve features in real-time with low latency.
Feast supports feature transformation for On Demand and Streaming data sources and will support Batch transformations in the future. For Streaming and Batch data sources, Feast requires a separate Feature Transformation Engine (in the batch case, this is typically your Offline Store). We are exploring adding a default streaming engine to Feast.
Domain expertise is recommended when integrating a data source with Feast understand the tradeoffs from different write patterns to your application
We recommend using Python for your Feature Store microservice. As mentioned in the document, precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Because of this we believe the pros of Python outweigh the costs, as reimplementing feature logic is undesirable. Java and Go Clients are also available for online feature retrieval.
Role-Based Access Control (RBAC) is a security mechanism that restricts access to resources based on the roles of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups can access or modify specific resources, thereby maintaining data security and operational integrity.

Language

Use Python to serve your features.

Why should you use Python to Serve features for Machine Learning?

Python has emerged as the primary language for machine learning, and this extends to feature serving and there are five main reasons Feast recommends using a microservice written in Python.

1. Python is the language of Machine Learning

You should meet your users where they are. Python’s popularity in the machine learning community is undeniable. Its simplicity and readability make it an ideal language for writing and understanding complex algorithms. Python boasts a rich ecosystem of libraries such as TensorFlow, PyTorch, XGBoost, and scikit-learn, which provide robust support for developing and deploying machine learning models and we want Feast in this ecosystem.

2. Precomputation is The Way

Precomputing features is the recommended optimal path to ensure low latency performance. Reducing feature serving to a lightweight database lookup is the ideal pattern, which means the marginal overhead of Python should be tolerable. Precomputation ensures product experiences for downstream services are also fast. Slow user experiences are bad user experiences. Precompute and persist data as much as you can.

3. Serving features in another language can lead to skew

Ensuring that features used during model training (offline serving) and online serving are available in production to make real-time predictions is critical. When features are initially developed, they are typically written in Python. This is due to the convenience and efficiency provided by Python's data manipulation libraries. However, in a production environment, there is often interest or pressure to rewrite these features in a different language, like Java, Go, or C++, for performance reasons. This reimplementation introduces a significant risk: training and serving skew. Note that there will always be some minor exceptions (e.g., any Time Since Last Event types of features) but this should not be the rule.

Training and serving skew occurs when there are discrepancies between the features used during model training and those used during prediction. This can lead to degraded model performance, unreliable predictions, and reduced velocity in releasing new features and new models. The process of rewriting features in another language is prone to errors and inconsistencies, which exacerbate this issue.

4. Reimplementation is Excessive

Rewriting features in another language is not only risky but also resource-intensive. It requires significant time and effort from engineers to ensure that the features are correctly translated. This process can introduce bugs and inconsistencies, further increasing the risk of training and serving skew. Additionally, maintaining two versions of the same feature codebase adds unnecessary complexity and overhead. More importantly, the opportunity cost of this work is high and requires twice the amount of resourcing. Reimplementing code should only be done when the performance gains are worth the investment. Features should largely be precomputed so the latency performance gains should not be the highest impact work that your team can accomplish.

5. Use existing Python Optimizations

Rather than switching languages, it is more efficient to optimize the performance of your feature store while keeping Python as the primary language. Optimization is a two step process.

Step 1: Quantify latency bottlenecks in your feature calculations

Use tools like CProfile to understand latency bottlenecks in your code. This will help you prioritize the biggest inefficiencies first. When we initially launched Python native transformations in Python, profiling the code helped us identify that Pandas resulted in a 10x overhead due to type conversion.

Step 2: Optimize your feature calculations

As mentioned, precomputation is the recommended path. In some cases, you may want fully synchronous writes from your data producer to your online feature store, in which case you will want your feature computations and writes to be very fast. In this case, we recommend optimizing the feature calculation code first.

You should optimize your code using libraries, tools, and caching. For example, identify whether your feature calculations can be optimized through vectorized calculations in NumPy; explore tools like Numba for faster execution; and cache frequently accessed data using tools like an lru_cache.

Lastly, Feast will continue to optimize serving in Python and making the overall infrastructure more performant. This will better serve the community.

So we recommend focusing on optimizing your feature-specific code, reporting latency bottlenecks to the maintainers, and contributing to help the infrastructure be more performant.

By keeping features in Python and optimizing performance, you can ensure consistency between training and serving, reduce the risk of errors, and focus on launching more product experiences for your customers.

Embrace Python for feature serving, and leverage its strengths to build robust and reliable machine learning systems.

Push vs Pull Model

Feast uses a Push Model, i.e., Data Producers push data to the feature store and Feast stores the feature values in the online store, to serve features in real-time.

In a Pull Model, Feast would pull data from the data producers at request time and store the feature values in the online store before serving them (storing them would actually be unnecessary). This approach would incur additional network latency as Feast would need to orchestrate a request to each data producer, which would mean the latency would be at least as long as your slowest call. So, in order to serve features as fast as possible, we push data to Feast and store the feature values in the online store.

The trade-off with the Push Model is that strong consistency is not guaranteed out of the box. Instead, strong consistency has to be explicitly designed for in orchestrating the updates to Feast and the client usage.

The significant advantage with this approach is that Feast is read-optimized for low-latency feature retrieval.

How to Push

Implicit in the Push model are decisions about how and when to push feature values to the online store.

From a developer's perspective, there are three ways to push feature values to the online store with different tradeoffs.

They are discussed further in the Write Patterns section.

Write Patterns

Feast uses a Push Model to push features to the online store.

This has two important consequences: (1) communication patterns between the Data Producer (i.e., the client) and Feast (i.e,. the server) and (2) feature computation and feature value write patterns to Feast's online store.

Data Producers (i.e., services that generate data) send data to Feast so that Feast can write feature values to the online store. That data can be either raw data where Feast computes and stores the feature values or precomputed feature values.

Communication Patterns

There are two ways a client (or Data Producer) can send data to the online store:

Synchronously
- Using a synchronous API call for a small number of entities or a single entity (e.g., using the push or write_to_online_store methods) or the Feature Server's push endpoint)
Asynchronously
- Using an asynchronous API call for a small number of entities or a single entity (e.g., using the push or write_to_online_store methods) or the Feature Server's push endpoint)
- Using a "batch job" for a large number of entities (e.g., using a batch materialization engine)

Note, in some contexts, developers may "batch" a group of entities together and write them to the online store in a single API call. This is a common pattern when writing data to the online store to reduce write loads but we would not qualify this as a batch job.

Feature Value Write Patterns

Writing feature values to the online store (i.e., the server) can be done in two ways: Precomputing the transformations client-side or Computing the transformations On Demand server-side.

Combining Approaches

In some scenarios, a combination of Precomputed and On Demand transformations may be optimal.

When selecting feature value write patterns, one must consider the specific requirements of your application, the acceptable correctness of the data, the latency tolerance, and the computational resources available. Making deliberate choices can help the performance and reliability of your service.

There are two ways the client can write feature values to the online store:

Precomputing transformations
Computing transformations On Demand
Hybrid (Precomputed + On Demand)

1. Precomputing Transformations

Precomputed transformations can happen outside of Feast (e.g., via some batch job or streaming application) or inside of the Feast feature server when writing to the online store via the push or write-to-online-store api.

2. Computing Transformations On Demand

On Demand transformations can only happen inside of Feast at either (1) the time of the client's request or (2) when the data producer writes to the online store.

3. Hybrid (Precomputed + On Demand)

The hybrid approach allows for precomputed transformations to happen inside or outside of Feast and have the On Demand transformations happen at client request time. This is particularly convenient for "Time Since Last" types of features (e.g., time since purchase).

Tradeoffs

When deciding between synchronous and asynchronous data writes, several tradeoffs should be considered:

Data Consistency: Asynchronous writes allow Data Producers to send data without waiting for the write operation to complete, which can lead to situations where the data in the online store is stale. This might be acceptable in scenarios where absolute freshness is not critical. However, for critical operations, such as calculating loan amounts in financial applications, stale data can lead to incorrect decisions, making synchronous writes essential.
Correctness: The risk of data being out-of-date must be weighed against the operational requirements. For instance, in a lending application, having up-to-date feature data can be crucial for correctness (depending upon the features and raw data), thus favoring synchronous writes. In less sensitive contexts, the eventual consistency offered by asynchronous writes might be sufficient.
Service Coupling: Synchronous writes result in tighter coupling between services. If a write operation fails, it can cause the dependent service operation to fail as well, which might be a significant drawback in systems requiring high reliability and independence between services.
Application Latency: Asynchronous writes typically reduce the perceived latency from the client's perspective because the client does not wait for the write operation to complete. This can enhance the user experience and efficiency in environments where operations are not critically dependent on immediate data freshness.

The table below can help guide the most appropriate data write and feature computation strategies based on specific application needs and data sensitivity.

Feature Transformation

A feature transformation is a function that takes some set of input data and returns some set of output data. Feature transformations can happen on either raw data or derived data.

Feature Transformation Engines

Feature transformations can be executed by three types of "transformation engines":

The Feast Feature Server
An Offline Store (e.g., Snowflake, BigQuery, DuckDB, Spark, etc.)
A Stream processor (e.g., Flink or Spark Streaming)

The three transformation engines are coupled with the communication pattern used for writes.

Importantly, this implies that different feature transformation code may be used under different transformation engines, so understanding the tradeoffs of when to use which transformation engine/communication pattern is extremely critical to the success of your implementation.

In general, we recommend transformation engines and network calls to be chosen by aligning it with what is most appropriate for the data producer, feature/model usage, and overall product.

Feature Serving and Model Inference

Note: this ML Infrastructure diagram highlights an orchestration pattern that is driven by a client application. This is not the only approach that can be taken and different patterns will result in different trade-offs.

Production machine learning systems can choose from four approaches to serving machine learning predictions (the output of model inference):

Online model inference with online features
Offline mode inference without online features
Online model inference with online features and cached predictions
Online model inference without features

Note: online features can be sourced from batch, streaming, or request data sources.

These three approaches have different tradeoffs but, in general, have significant implementation differences.

1. Online Model Inference with Online Features

Online model inference with online features is a powerful approach to serving data-driven machine learning applications. This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe). This approach is particularly useful for applications where request-time data is required to run inference.

features = store.get_online_features(
    feature_refs=[
        "user_data:click_through_rate",
        "user_data:number_of_clicks",
        "user_data:average_page_duration",
    ],
    entity_rows=[{"user_id": 1}],
)
model_predictions = model_server.predict(features)

2. Offline Model Inference without Online Features

Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement. This approach simply treats the model predictions as a feature and serves them from the feature store using the standard Feast sdk. These model predictions are typically generated through some batch process where the model scores are precomputed. As a concrete example, the batch process can be as simple as a script that runs model inference locally for a set of users that can output a CSV. This output file could be used for materialization so that the model could be served online as shown in the code below.

model_predictions = store.get_online_features(
    feature_refs=[
        "user_data:model_predictions",
    ],
    entity_rows=[{"user_id": 1}],
)

Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and materialized to the online store.

While this approach can lead to quick impact for different business use cases, it suffers from stale data as well as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff may be tolerable.

3. Online Model Inference with Online Features and Cached Predictions

This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running model inference when data producers write features to the online store. This approach is particularly useful for applications where features are coming from multiple data sources, the model is computationally expensive to run, or latency is a significant constraint.

# Client Reads
features = store.get_online_features(
    feature_refs=[
        "user_data:click_through_rate",
        "user_data:number_of_clicks",
        "user_data:average_page_duration",
        "user_data:model_predictions",
    ],
    entity_rows=[{"user_id": 1}],
)
if features.to_dict().get('user_data:model_predictions') is None:
    model_predictions = model_server.predict(features)
    store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))

Note that in this case a seperate call to write_to_online_store is required when the underlying data changes and predictions change along with it.

# Client Writes from the Data Producer
user_data = request.POST.get('user_data')
model_predictions = model_server.predict(user_data) # assume this includes `user_data` in the Data Frame
store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))

While this requires additional writes for every data producer, this approach will result in the lowest latency for model inference.

4. Online Model Inference without Features

This approach does not require Feast. The model server can directly serve predictions without any features. This approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions.

Note that generative models using Retrieval Augmented Generation (RAG) do require features where the document embeddings are treated as features, which Feast supports (this would fall under "Online Model Inference with Online Features").

Client Orchestration

Implicit in the code examples above is a design choice about how clients orchestrate calls to get features and run model inference. The examples had a Feast-centric pattern because they are inputs to the model, so the sequencing is fairly obvious. An alternative approach can be Inference-centric where a client would call an inference endpoint and the inference service would be responsible for orchestration.

Role-Based Access Control (RBAC)

Introduction

Role-Based Access Control (RBAC) is a security mechanism that restricts access to resources based on the roles of individual users within an organization. In the context of the Feast, RBAC ensures that only authorized users or groups can access or modify specific resources, thereby maintaining data security and operational integrity.

Functional Requirements

The RBAC implementation in Feast is designed to:

Assign Permissions: Allow administrators to assign permissions for various operations and resources to users or groups based on their roles.
Seamless Integration: Integrate smoothly with existing business code without requiring significant modifications.
Backward Compatibility: Maintain support for non-authorized models as the default to ensure backward compatibility.

Business Goals

The primary business goals of implementing RBAC in the Feast are:

Feature Sharing: Enable multiple teams to share the feature store while ensuring controlled access. This allows for collaborative work without compromising data security.
Access Control Management: Prevent unauthorized access to team-specific resources and spaces, governing the operations that each user or group can perform.

Reference Architecture

Feast operates as a collection of connected services, each enforcing authorization permissions. The architecture is designed as a distributed microservices system with the following key components:

Service Endpoints: These enforce authorization permissions, ensuring that only authorized requests are processed.
Client Integration: Clients authenticate with feature servers by attaching authorization token to each request.
Service-to-Service Communication: This is always granted.

Permission Model

The RBAC system in Feast uses a permission model that defines the following concepts:

Resource: An object within Feast that needs to be secured against unauthorized access.
Action: A logical operation performed on a resource, such as Create, Describe, Update, Delete, Read, or write operations.
Policy: A set of rules that enforce authorization decisions on resources. The default implementation uses role-based policies.

Authorization Architecture

The authorization architecture in Feast is built with the following components:

Token Extractor: Extracts the authorization token from the request header.
Token Parser: Parses the token to retrieve user details.
Policy Enforcer: Validates the secured endpoint against the retrieved user details.
Token Injector: Adds the authorization token to each secured request header.