Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 113 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

v0.25-branch

Loading...

Loading...

Loading...

Getting started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tutorials

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

How-to Guides

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Overview

Feast project structure

Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev, staging, prod).

Data ingestion

For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.

Feature registration and retrieval

Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).

Feast supports several patterns of feature retrieval.

Use case
Example
API

Training data generation

Fetching user and item features for (user, item) pairs when training a production recommendation model

get_historical_features

Offline feature retrieval for batch predictions

Predicting user churn for all users on a daily basis

get_historical_features

Online feature retrieval for real-time model predictions

Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent

get_online_features

Quickstart

In this tutorial we will

  1. Deploy a local feature store with a Parquet file offline store and Sqlite online store.

  2. Build a training dataset using our time series features from our Parquet files.

  3. Ingest batch features ("materialization") and streaming features (via a Push API) into the online store.

  4. Read the latest features from the offline store for batch scoring

  5. Read the latest features from the online store for real-time inference.

  6. Explore the (experimental) Feast UI

Overview

In this tutorial, we'll use Feast to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:

  1. Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.

    • Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.

  2. Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other data sources.

    • Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.

  3. Feature and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.

    • Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).

    • (Experimental) Feast enables light-weight feature transformations so users can re-use transformation logic across online / offline use cases and across models.

Step 1: Install Feast

Install the Feast SDK and CLI using pip:

pip install feast

Step 2: Create a feature repository

Bootstrap a new feature repository using feast init from the command line.

feast init my_project
cd my_project/feature_repo
Creating a new Feast repository in /home/Jovyan/my_project.

Let's take a look at the resulting demo repo itself. It breaks down into

  • data/ contains raw demo parquet data

  • example_repo.py contains demo feature definitions

  • feature_store.yaml contains a demo setup configuring where data sources are

  • test_workflow.py showcases how to run all key Feast commands, including defining, retrieving, and pushing features. You can run this with python test_workflow.py.

project: my_project
# By default, the registry is a file (but can be turned into a more scalable SQL-backed registry)
registry: data/registry.db
# The provider primarily specifies default offline / online stores & storing the registry in a given cloud
provider: local
online_store:
  type: sqlite
  path: data/online_store.db
entity_key_serialization_version: 2
# This is an example feature definition file

from datetime import timedelta

import pandas as pd

from feast import (
    Entity,
    FeatureService,
    FeatureView,
    Field,
    FileSource,
    PushSource,
    RequestSource,
)
from feast.on_demand_feature_view import on_demand_feature_view
from feast.types import Float32, Float64, Int64

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_stats_source = FileSource(
    name="driver_hourly_stats_source",
    path="%PARQUET_PATH%",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_stats_fv = FeatureView(
    # The unique name of this feature view. Two feature views in a single
    # project cannot have the same name
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(days=1),
    # The list of features defined below act as a schema to both define features
    # for both materialization of features into a store, and are used as references
    # during retrieval for building a training dataset or serving features
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_stats_source,
    # Tags are user defined key/value pairs that are attached to each
    # feature view
    tags={"team": "driver_performance"},
)

# Defines a way to push data (to be available offline, online or both) into Feast.
driver_stats_push_source = PushSource(
    name="driver_stats_push_source",
    batch_source=driver_stats_source,
)

# Define a request data source which encodes features / information only
# available at request time (e.g. part of the user initiated HTTP request)
input_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name="val_to_add", dtype=Int64),
        Field(name="val_to_add_2", dtype=Int64),
    ],
)


# Define an on demand feature view which can generate new features based on
# existing feature views and RequestSource features
@on_demand_feature_view(
    sources=[driver_stats_fv, input_request],
    schema=[
        Field(name="conv_rate_plus_val1", dtype=Float64),
        Field(name="conv_rate_plus_val2", dtype=Float64),
    ],
)
def transformed_conv_rate(inputs: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df["conv_rate_plus_val1"] = inputs["conv_rate"] + inputs["val_to_add"]
    df["conv_rate_plus_val2"] = inputs["conv_rate"] + inputs["val_to_add_2"]
    return df


# This groups features into a model version
driver_activity_v1 = FeatureService(
    name="driver_activity_v1",
    features=[
        driver_stats_fv[["conv_rate"]],  # Sub-selects a feature from a feature view
        transformed_conv_rate,  # Selects all features from the feature view
    ],
)
driver_activity_v2 = FeatureService(
    name="driver_activity_v2", features=[driver_stats_fv, transformed_conv_rate]
)

The feature_store.yaml file configures the key overall architecture of the feature store.

The provider value sets default offline and online stores.

  • The offline store provides the compute layer to process historical data (for generating training data & feature values for serving).

  • The online store is a low latency store of the latest feature values (for powering real-time inference).

Valid values for provider in feature_store.yaml are:

  • local: use a SQL registry or local file registry. By default, use a file / Dask based offline store + SQLite online store

  • gcp: use a SQL registry or GCS file registry. By default, use BigQuery (offline store) + Google Cloud Datastore (online store)

  • aws: use a SQL registry or S3 file registry. By default, use Redshift (offline store) + DynamoDB (online store)

Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

import pandas as pd
pd.read_parquet("data/driver_stats.parquet")

Step 3: Run sample workflow

There's an included test_workflow.py file which runs through a full sample workflow:

  1. Register feature definitions through feast apply

  2. Generate a training dataset (using get_historical_features)

  3. Generate features for batch scoring (using get_historical_features)

  4. Ingest batch features into an online store (using materialize_incremental)

  5. Fetch online features to power real time inference (using get_online_features)

  6. Ingest streaming features into offline / online stores (using push)

  7. Verify online features are updated / fresher

We'll walk through some snippets of code below and explain

Step 3a: Register feature definitions and deploy your feature store

The apply command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example_repo.py and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by configuring online_store in feature_store.yaml.

feast apply
Created entity driver
Created feature view driver_hourly_stats
Created on demand feature view transformed_conv_rate
Created feature service driver_activity_v1
Created feature service driver_activity_v2

Created sqlite table my_project_driver_hourly_stats

Step 3b: Generating training data or powering batch scoring models

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.

Feast needs a list of entities (e.g. driver ids) and timestamps. Feast will intelligently join relevant tables to create the relevant feature vectors. There are two ways to generate this list:

  1. The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation.

  • Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.

Generating training data

from datetime import datetime
import pandas as pd

from feast import FeatureStore

# Note: see https://docs.feast.dev/getting-started/concepts/feature-retrieval for 
# more details on how to retrieve for all entities in the offline store instead
entity_df = pd.DataFrame.from_dict(
    {
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
        ],
        # (optional) label name -> label values. Feast does not process these
        "label_driver_reported_satisfaction": [1, 5, 3],
        # values we're using for an on-demand transformation
        "val_to_add": [1, 2, 3],
        "val_to_add_2": [10, 20, 30],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())
----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   event_timestamp                     3 non-null      datetime64[ns, UTC]
 1   driver_id                           3 non-null      int64
 2   label_driver_reported_satisfaction  3 non-null      int64
 3   conv_rate                           3 non-null      float32
 4   acc_rate                            3 non-null      float32
 5   avg_daily_trips                     3 non-null      int32
dtypes: datetime64[ns, UTC](1), float32(2), int32(1), int64(2)
memory usage: 132.0 bytes
None

----- Example features -----

                   event_timestamp  driver_id  ...  acc_rate  avg_daily_trips
0 2021-08-23 15:12:55.489091+00:00       1003  ...  0.077863              741
1 2021-08-23 15:49:55.489089+00:00       1002  ...  0.074327              113
2 2021-08-23 16:14:55.489075+00:00       1001  ...  0.105046              347

[3 rows x 6 columns]

Run offline inference (batch scoring)

To power a batch model, we primarily need to generate features with the get_historical_features call, but using the current timestamp

entity_df["event_timestamp"] = pd.to_datetime("now", utc=True)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("\n----- Example features -----\n")
print(training_df.head())
----- Example features -----

   driver_id                  event_timestamp  ... acc_rate  avg_daily_trips  conv_rate_plus_val1  
0       1001 2022-08-08 18:22:06.555018+00:00  ... 0.864639              359             1.663844
1       1002 2022-08-08 18:22:06.555018+00:00  ... 0.695982              311             2.151189 
2       1003 2022-08-08 18:22:06.555018+00:00  ... 0.949191              789             3.769165 

Step 3c: Ingest batch features into your online store

We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental serializes all new features since the last materialize call).

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
Materializing 1 feature views to 2021-08-23 16:25:46+00:00 into the sqlite online
store.

driver_hourly_stats from 2021-08-22 16:25:47+00:00 to 2021-08-23 16:25:46+00:00:
100%|████████████████████████████████████████████| 5/5 [00:00<00:00, 592.05it/s]

Step 3d: Fetching feature vectors for inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features(). These feature vectors can then be fed to the model.

from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

pprint(feature_vector)
{
 'acc_rate': [0.5732735991477966, 0.7828438878059387],
 'avg_daily_trips': [33, 984],
 'conv_rate': [0.15498852729797363, 0.6263588070869446],
 'driver_id': [1004, 1005]
}

Step 3e: Using a feature service to fetch online features instead.

The driver_activity_v1 feature service pulls all features from the driver_hourly_stats feature view:

from feast import FeatureService
driver_stats_fs = FeatureService(
    name="driver_activity_v1", features=[driver_hourly_stats_view]
)
from pprint import pprint
from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity_v1")
feature_vector = feature_store.get_online_features(
    features=feature_service,
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()
pprint(feature_vector)
{
 'acc_rate': [0.5732735991477966, 0.7828438878059387],
 'avg_daily_trips': [33, 984],
 'conv_rate': [0.15498852729797363, 0.6263588070869446],
 'driver_id': [1004, 1005]
}

Step 4: Browse your features with the Web UI (experimental)

View all registered features, data sources, entities, and feature services with the Web UI.

One of the ways to view this is with the feast ui command.

feast ui
INFO:     Started server process [66664]
08/17/2022 01:25:49 PM uvicorn.error INFO: Started server process [66664]
INFO:     Waiting for application startup.
08/17/2022 01:25:49 PM uvicorn.error INFO: Waiting for application startup.
INFO:     Application startup complete.
08/17/2022 01:25:49 PM uvicorn.error INFO: Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8888 (Press CTRL+C to quit)
08/17/2022 01:25:49 PM uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8888 (Press CTRL+C to quit)

Step 5: Re-examine test_workflow.py

Take a look at test_workflow.py again. It showcases many sample flows on how to interact with Feast. You'll see these show up in the upcoming concepts + architecture + tutorial pages as well.

Next steps

Community & getting help

Links & Resources

    • Feast users should join #feast-general or #feast-beginners to ask questions

    • Feast developers / contributors should join #feast-development

    • Design proposals in the form of Request for Comments (RFC).

    • User surveys and meeting minutes.

    • Slide decks of conferences our contributors have spoken at.

How can I get help?

  • Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).

Community Calls

General community call (biweekly)

We have a user and contributor community call every two weeks (US & EU friendly).

Please join the above Feast user groups in order to see calendar invites to the community calls

Frequency (every 2 weeks)

  • Tuesday 10:00 am to 10:30 am PST

Links

Developers call (biweekly)

We also have a #feast-development community call every two weeks, where we discuss contributions + brainstorm best practices.

Frequency (every 2 weeks)

  • Tuesday 8:00 am to 8:30 am PST

Links

Roadmap

The list below contains the functionality that contributors are planning to develop for Feast.

  • We welcome contribution to all items in the roadmap!

  • Have questions about the roadmap? Go to the Slack channel to ask on #feast-development.

  • Data Sources

  • Offline Stores

  • Online Stores

  • Feature Engineering

  • Streaming

  • Deployments

  • Feature Serving

  • Feature Discovery and Governance

Introduction

What is Feast?

Feast (Feature Store) is a customizable operational data system that re-uses existing infrastructure to manage and serve machine learning features to realtime models.

Feast allows ML platform teams to:

  • Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).

  • Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.

  • Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.

Note: Feast today primarily addresses timestamped structured data.

Who is Feast for?

Feast helps ML platform teams with DevOps experience productionize real-time models. Feast can also help these teams build towards a feature platform that improves collaboration between engineers and data scientists.

Feast is likely not the right tool if you

  • are in an organization that’s just getting started with ML and is not yet sure what the business impact of ML is

  • rely primarily on unstructured data

  • need very low latency feature retrieval (e.g. p99 feature retrieval << 10ms)

  • have a small team to support a large number of use cases

What Feast is not?

Feast is not

  • a data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.

  • a database: Feast is not a database, but helps manage data stored in other systems (e.g. BigQuery, Snowflake, DynamoDB, Redis) to make features consistently available at training / serving time

Feast does not fully solve

Example use cases

Many companies have used Feast to power real-world ML use cases such as:

  • Personalizing online recommendations by leveraging pre-computed historical user or item features.

  • Online fraud detection, using features that compare against (pre-computed) historical transaction patterns

  • Churn prediction (an offline model), generating feature values for all users at a fixed cadence in batch

  • Credit scoring, using pre-computed historical features to compute probability of default

How can I get started?

Explore the following resources to get started with Feast:

Feature view

Feature views

Note: Feature views do not work with non-timestamped data. A workaround is to insert dummy timestamps.

Feature views consist of:

  • a name to uniquely identify this feature view in the project.

  • (optional, but recommended) metadata (for example, description, or other free-form metadata via tags)

  • (optional) a TTL, which limits how far back Feast will look when generating historical datasets

Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view.

from feast import BigQuerySource, Entity, FeatureView, Field
from feast.types import Float32, Int64

driver = Entity(name="driver", join_keys=["driver_id"])

driver_stats_fv = FeatureView(
    name="driver_activity",
    entities=[driver],
    schema=[
        Field(name="trips_today", dtype=Int64),
        Field(name="rating", dtype=Float32),
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.driver_activity"
    )
)

Feature views are used during

  • The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.

  • Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.

Feature views without entities

If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only timestamps are needed for this feature view).

from feast import BigQuerySource, FeatureView, Field
from feast.types import Int64

global_stats_fv = FeatureView(
    name="global_stats",
    entities=[],
    schema=[
        Field(name="total_trips_today_by_all_drivers", dtype=Int64),
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.global_stats"
    )
)

Feature inferencing

If the schema parameter is not specified in the creation of the feature view, Feast will infer the features during feast apply by creating a Field for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.

Entity aliasing

"Entity aliases" can be specified to join entity_dataframe columns that do not match the column names in the source table of a FeatureView.

This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.

It is suggested that you dynamically specify the new FeatureView name using .with_name and join_key_map override using .with_join_key_map instead of needing to register each new copy.

from feast import BigQuerySource, Entity, FeatureView, Field
from feast.types import Int32, Int64

location = Entity(name="location", join_keys=["location_id"])

location_stats_fv= FeatureView(
    name="location_stats",
    entities=[location],
    schema=[
        Field(name="temperature", dtype=Int32),
        Field(name="location_id", dtype=Int64),
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.location_stats"
    ),
)
from location_stats_feature_view import location_stats_fv

temperatures_fs = FeatureService(
    name="temperatures",
    features=[
        location_stats_fv
            .with_name("origin_stats")
            .with_join_key_map(
                {"location_id": "origin_id"}
            ),
        location_stats_fv
            .with_name("destination_stats")
            .with_join_key_map(
                {"location_id": "destination_id"}
            ),
    ],
)

Field

A field or feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month. Supported types for fields in Feast can be found in sdk/python/feast/types.py.

Fields are defined as part of feature views. Since Feast does not transform data, a field is essentially a schema that only contains a name and a type:

from feast import Field
from feast.types import Float32

trips_today = Field(
    name="trips_today",
    dtype=Float32
)

[Alpha] On demand feature views

On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both the historical retrieval and online retrieval paths.

Currently, these transformations are executed locally. This is fine for online serving, but does not scale well to offline retrieval.

Why use on demand feature views?

This enables data scientists to easily impact the online feature retrieval path. For example, a data scientist could

  1. Call get_historical_features to generate a training dataframe

  2. Iterate in notebook on feature engineering in Pandas

  3. Copy transformation logic into on demand feature views and commit to a dev branch of the feature repository

  4. Verify with get_historical_features (on a small dataset) that the transformation gives expected output over historical data

  5. Verify with get_online_features on dev branch that the transformation correctly outputs online features

  6. Submit a pull request to the staging / prod branches which impact production traffic

from feast import Field, RequestSource
from feast.on_demand_feature_view import on_demand_feature_view
from feast.types import Float64

# Define a request data source which encodes features / information only
# available at request time (e.g. part of the user initiated HTTP request)
input_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name="val_to_add", dtype=PrimitiveFeastType.INT64),
        Field(name="val_to_add_2": dtype=PrimitiveFeastType.INT64),
    ]
)

# Use the input data and feature view features to create new features
@on_demand_feature_view(
   sources=[
       driver_hourly_stats_view,
       input_request
   ],
   schema=[
     Field(name='conv_rate_plus_val1', dtype=Float64),
     Field(name='conv_rate_plus_val2', dtype=Float64)
   ]
)
def transformed_conv_rate(features_df: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df['conv_rate_plus_val1'] = (features_df['conv_rate'] + features_df['val_to_add'])
    df['conv_rate_plus_val2'] = (features_df['conv_rate'] + features_df['val_to_add_2'])
    return df

[Alpha] Stream feature views

A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.

Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting. Here is an example definition of a stream feature view with an attached transformation:

from datetime import timedelta

from feast import Field, FileSource, KafkaSource, stream_feature_view
from feast.data_format import JsonFormat
from feast.types import Float32

driver_stats_batch_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
)

driver_stats_stream_source = KafkaSource(
    name="driver_stats_stream",
    kafka_bootstrap_servers="localhost:9092",
    topic="drivers",
    timestamp_field="event_timestamp",
    batch_source=driver_stats_batch_source,
    message_format=JsonFormat(
        schema_json="driver_id integer, event_timestamp timestamp, conv_rate double, acc_rate double, created timestamp"
    ),
    watermark_delay_threshold=timedelta(minutes=5),
)

@stream_feature_view(
    entities=[driver],
    ttl=timedelta(seconds=8640000000),
    mode="spark",
    schema=[
        Field(name="conv_percentage", dtype=Float32),
        Field(name="acc_percentage", dtype=Float32),
    ],
    timestamp_field="event_timestamp",
    online=True,
    source=driver_stats_stream_source,
)
def driver_hourly_stats_stream(df: DataFrame):
    from pyspark.sql.functions import col

    return (
        df.withColumn("conv_percentage", col("conv_rate") * 100.0)
        .withColumn("acc_percentage", col("acc_rate") * 100.0)
        .drop("conv_rate", "acc_rate")
    )

Concepts

Feature retrieval

Overview

Generally, Feast supports several patterns of feature retrieval:

  1. Training data generation (via feature_store.get_historical_features(...))

  2. Offline feature retrieval for batch scoring (via feature_store.get_historical_features(...))

  3. Online feature retrieval for real-time model predictions

    • via the SDK: feature_store.get_online_features(...)

    • via deployed feature server endpoints: requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))

Each of these retrieval mechanisms accept:

  • some way of specifying entities (to fetch features for)

For code examples of how the below work, inspect the generated repository from feast init -t [YOUR TEMPLATE] (gcp, snowflake, and aws are the most fully fleshed).

Concepts

Before diving into how to retrieve features, we need to understand some high level concepts in Feast.

Feature Services

Feature services are used during

  • The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.

  • Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now())

  • Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.

Applying a feature service does not result in an actual service being deployed.

Feature services enable referencing all or some features from a feature view.

Retrieving from the online store with a feature service

Retrieving from the offline store with a feature service

Feature References

This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.

Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>

Feature references are used for the retrieval of features from Feast:

It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, it is not possible to reference (or retrieve) features from multiple projects at the same time.

Event timestamp

The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.

Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.

Dataset

A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.

Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.

Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.

Retrieving historical features (for training data or batch scoring)

Feast abstracts away point-in-time join complexities with the get_historical_features API.

We go through the major steps, and also show example code. Note that the quickstart templates generally have end-to-end working examples for all these cases.

Full example: generate training data
Full example: retrieve offline features for batch scoring

The main difference here compared to training data generation is how to handle timestamps in the entity dataframe. You want to pass in the current time to get the latest feature values for all your entities.

Step 1: Specifying Features

Feast accepts either:

Example: querying a feature service (recommended)

Example: querying a list of feature references

Step 2: Specifying Entities

Feast accepts either a Pandas dataframe as the entity dataframe (including entity keys and timestamps) or a SQL query to generate the entities.

Both approaches must specify the full entity key needed as well as the timestamps. Feast then joins features onto this dataframe.

Example: entity dataframe for generating training data

Example: entity SQL query for generating training data

You can also pass a SQL string to generate the above dataframe. This is useful for getting all entities in a timeframe from some data source.

Retrieving online features (for model inference)

Feast will ensure the latest feature values for registered features are available. At retrieval time, you need to supply a list of entities and the corresponding features to be retrieved. Similar to get_historical_features, we recommend using feature services as a mechanism for grouping features in a model version.

Note: unlike get_historical_features, the entity_rows do not need timestamps since you only want one feature value per entity key.

There are several options for retrieving online features: through the SDK, or through a feature server

Full example: retrieve online features for real-time model inference (Python SDK)
Full example: retrieve online features for real-time model inference (Feature Server)

Registry

Feast uses a registry to store all applied Feast objects (e.g. Feature views, entities, etc). The registry exposes methods to apply, list, retrieve and delete these objects, and is an abstraction with multiple implementations.

Options for registry implementations

File-based registry

By default, Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS, or Azure).

The quickstart guides that use feast init will use a registry on a local file system. To allow Feast to configure a remote file registry, you need to create a GCS / S3 bucket that Feast can understand:

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

SQL Registry

Updating the registry

Accessing the registry from clients

Users can specify the registry through a feature_store.yaml config file, or programmatically. We often see teams preferring the programmatic approach because it makes notebook driven development very easy:

Option 1: programmatically specifying the registry

Option 2: specifying the registry in the project's feature_store.yaml file

Instantiating a FeatureStore object can then point to this:

Entity

An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.

The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.

Entities are used by Feast in many contexts, as we explore below:

Use case #1: Defining and storing features

Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:

  • zero entities (e.g. a global feature like num_daily_global_transactions)

  • one entity (e.g. a user feature like user_age or last_5_bought_items)

  • multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)

Feast refers to this collection of entities for a feature view as an entity key.

Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.

Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.

Use case #2: Retrieving features

At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).

Q: Can I retrieve features for all entities?

Kind of.

For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.

Point-in-time joins

Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today, and earnings_today):

The above table can be registered with Feast through the following feature view:

Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.

Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats feature view onto it, while preserving the trip_success column:

The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:

For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time specified.

Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).

Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:

Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.

[Alpha] Saved dataset

Dataset can be created from:

  1. Results of historical retrieval

  2. [planned] Logging features during writing to online store (from batch source or stream)

Creating a saved dataset from historical retrieval

To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features method first and then pass the returned retrieval job to create_saved_dataset method. create_saved_dataset will trigger the provided retrieval job (by calling .persist() on it) to store the data using the specified storage behind the scenes. Storage type must be the same as the globally configured offline store (e.g it's impossible to persist data to a different offline source). create_saved_dataset will also create a SavedDataset object with all of the related metadata and will write this object to the registry.

Saved dataset can be retrieved later using the get_saved_dataset method in the feature store:


Architecture

The top-level namespace within Feast is a project. Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.

For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page ()

In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see

Note that there are many other offline / online stores Feast works with, including Spark, Azure, Hive, Trino, and PostgreSQL via community plugins. See for all supported data sources.

A custom setup can also be made by following .

The user can also query that table with a SQL query which pulls entities. See the documentation on for details

You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same API below. More information can be found .

Read the page to understand the Feast data model.

Read the page.

Check out our section for more examples on how to use Feast.

Follow our guide for a more in-depth tutorial on using Feast.

Join other Feast users and contributors in and become part of the community!

: Find the complete Feast codebase on GitHub.

: See the governance model of Feast, including who the maintainers are and how decisions are made.

: Feel free to ask questions or say hello! This is the main place where maintainers and contributors brainstorm and where users ask questions or discuss best practices.

: We have both a user and developer mailing list.

Feast users should join group by clicking .

Feast developers / contributors should join group by clicking .

: Includes community calls and design meetings.

: This folder is used as a central repository for all Feast resources. For example:

: Our LFAI wiki page contains links to resources for contributors and maintainers.

GitHub Issues: Found a bug or need a feature? .

StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to .

Zoom:

Meeting notes (incl recordings):

Meeting notes (incl recordings):

Zoom:

Kafka / Kinesis sources (via )

On-demand Transformations (Alpha release. See )

Streaming Transformations (Alpha release. See )

Batch transformation (In progress. See )

AWS Lambda (Alpha release. See )

Kubernetes (See )

Data Quality Management (See )

Amundsen integration (see )

DataHub integration (see )

Feast Web UI (Beta release. See )

an / system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Users often leverage tools like to manage upstream data transformations.

a data orchestration tool: Feast does not manage or orchestrate complex workflow DAGs. It relies on upstream data pipelines to produce feature values and integrations with tools like to make features consistently available.

reproducible model training / model backtesting / experiment management: Feast captures feature and model metadata, but does not version-control datasets / labels or manage train / test splits. Other tools like , , and are better suited for this.

batch + streaming feature engineering: Feast primarily processes already transformed feature values (though it offers experimental light-weight transformations). Users usually integrate Feast with upstream systems (e.g. existing ETL/ELT pipelines). is a more fully featured feature platform which addresses these needs.

native streaming feature integration: Feast enables users to push streaming features, but does not pull from streaming sources or manage streaming pipelines. is a more fully featured feature platform which orchestrates end to end streaming pipelines.

feature sharing: Feast has experimental functionality to enable discovery and cataloguing of feature metadata with a . Feast also has community contributed plugins with and . also more robustly addresses these needs.

lineage: Feast helps tie feature values to model versions, but is not a complete solution for capturing end-to-end lineage from raw data sources to model versions. Feast also has community contributed plugins with and . captures more end-to-end lineage by also managing feature transformations.

data quality / drift detection: Feast has experimental integrations with , but is not purpose built to solve data drift / data quality issues. This requires more sophisticated monitoring across data pipelines, served feature values, labels, and model versions.

The best way to learn Feast is to use it. Join our and head over to our and try it out!

is the fastest way to get started with Feast

describes all important Feast API concepts

describes Feast's overall architecture.

shows full examples of using Feast in machine learning applications.

provides a more in-depth guide to using Feast.

contains detailed API and design documents.

contains resources for anyone who wants to contribute to Feast.

A feature view is an object that represents a logical group of time-series feature data as it is found in a . Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see ).

a

zero or more

If the features are not related to a specific object, the feature view might not have entities; see below.

(optional, but recommended) a schema specifying one or more (without this, Feast will infer the schema by reading from the data source)

Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .

Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .

Feature names must be unique within a .

Each field can have additional metadata associated with it, specified as key-value .

See for a example of how to use stream feature views to register your own streaming data pipelines in Feast.

some way to specify the features to fetch (either via , which group features needed for a model version, or )

Before beginning, you need to instantiate a local FeatureStore object that knows how to parse the registry (see )

A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.

Note, if you're using , then those features can be added here without additional entity values in the entity_rows parameter.

, which group features needed for a model version

This approach requires you to deploy a feature server (see ).

Alternatively, a can be used for a more scalable registry.

This supports any SQLAlchemy compatible database as a backend. The exact schema can be seen in

We recommend users store their Feast feature definitions in a version controlled repository, which then via CI/CD automatically stays synced with the registry. Users will often also want multiple registries to correspond to different environments (e.g. dev vs staging vs prod), with staging and production registries with locked down write access since they can impact real user traffic. See for details on how to set this up.

At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch correct features for to generate a training dataset.

In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an that welcomes contribution to make this a more intuitive API.

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

[planned] Logging request (including input for ) and response during feature serving

Check out our to see how this concept can be applied in a real-world use case.

Data ingestion
Running Feast with Snowflake/GCP/AWS
Third party integrations
Customizing Feast
feature retrieval
here
Concepts
Architecture
Tutorials
Running Feast with Snowflake/GCP/AWS
Slack
GitHub Repository
Community Governance Doc
Slack
Mailing list
feast-discuss@googlegroups.com
here
feast-dev@googlegroups.com
here
Community Calendar
Google Folder
Feast Linux Foundation Wiki
Create an issue on GitHub
StackOverflow
https://zoom.us/j/6325193230
https://bit.ly/feast-notes
Feast Development Biweekly
https://zoom.us/j/93657748160?pwd=K3ZpdzhqejgrcXNhc3BlSjFMdzUxdz09
Snowflake source
Redshift source
BigQuery source
Parquet file source
Azure Synapse + Azure SQL source (contrib plugin)
Hive (community plugin)
Postgres (contrib plugin)
Spark (contrib plugin)
push support into the online store
Snowflake
Redshift
BigQuery
Azure Synapse + Azure SQL (contrib plugin)
Hive (community plugin)
Postgres (contrib plugin)
Trino (contrib plugin)
Spark (contrib plugin)
In-memory / Pandas
Custom offline store support
Snowflake
DynamoDB
Redis
Datastore
SQLite
Azure Cache for Redis (community plugin)
Postgres (contrib plugin)
Custom online store support
Cassandra / AstraDB
RFC
RFC
RFC
Custom streaming ingestion job support
Push based streaming data ingestion to online store
Push based streaming data ingestion to offline store
RFC
guide
Python feature server
Java feature server (alpha)
Go feature server (alpha)
RFC
Feast extractor
DataHub Feast docs
docs
ETL
ELT
dbt
Airflow
DVC
MLflow
Kubeflow
Tecton
Tecton
Feast web UI (alpha)
DataHub
Amundsen
Tecton
DataHub
Amundsen
Tecton
Great Expectations
Slack channel
Quickstart
Quickstart
Concepts
Architecture
Tutorials
Running Feast with Snowflake/GCP/AWS
Reference
Contributing
data source
entities
stream sources
tags
here
data source
[Alpha] On demand feature views
feature views without entities
features
feature view
from driver_ratings_feature_view import driver_ratings_fv
from driver_trips_feature_view import driver_stats_fv

driver_stats_fs = FeatureService(
    name="driver_activity",
    features=[driver_stats_fv, driver_ratings_fv[["lifetime_rating"]]]
)
from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity")
features = feature_store.get_online_features(
    features=feature_service, entity_rows=[entity_dict]
)
from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity")
feature_store.get_historical_features(features=feature_service, entity_df=entity_df)
online_features = fs.get_online_features(
    features=[
        'driver_locations:lon',
        'drivers_activity:trips_today'
    ],
    entity_rows=[
        # {join_key: entity_value}
        {'driver': 'driver_1001'}
    ]
)
entity_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001, 1002, 1003, 1004, 1001],
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
            datetime(2021, 4, 12, 15, 1, 12),
            datetime.now()
        ]
    }
)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=store.get_feature_service("model_v1"),
).to_df()
print(training_df.head())
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get the latest feature values for unique entities
entity_sql = f"""
    SELECT
        driver_id,
        CURRENT_TIMESTAMP() as event_timestamp
    FROM {store.get_data_source("driver_hourly_stats_source").get_table_query_string()}
    WHERE event_timestamp BETWEEN '2021-01-01' and '2021-12-31'
    GROUP BY driver_id
"""
batch_scoring_features = store.get_historical_features(
    entity_df=entity_sql,
    features=store.get_feature_service("model_v2"),
).to_df()
# predictions = model.predict(batch_scoring_features)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=store.get_feature_service("model_v1"),
).to_df()
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven"
    ],
).to_df()
entity_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001, 1002, 1003, 1004, 1001],
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
            datetime(2021, 4, 12, 15, 1, 12),
            datetime.now()
        ]
    }
)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven"
    ],
).to_df()
entity_sql = f"""
    SELECT
        driver_id,
        event_timestamp
    FROM {store.get_data_source("driver_hourly_stats_source").get_table_query_string()}
    WHERE event_timestamp BETWEEN '2021-01-01' and '2021-12-31'
"""
training_df = store.get_historical_features(
    entity_df=entity_sql,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven"
    ],
).to_df()
from feast import RepoConfig, FeatureStore
from feast.repo_config import RegistryConfig

repo_config = RepoConfig(
    registry=RegistryConfig(path="gs://feast-test-gcs-bucket/registry.pb"),
    project="feast_demo_gcp",
    provider="gcp",
)
store = FeatureStore(config=repo_config)

features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven",
    ],
    entity_rows=[
        {
            "driver_id": 1001,
        }
    ],
).to_dict()
import requests
import json

online_request = {
    "features": [
        "driver_hourly_stats:conv_rate",
    ],
    "entities": {"driver_id": [1001, 1002]},
}
r = requests.post('http://localhost:6566/get-online-features', data=json.dumps(online_request))
print(json.dumps(r.json(), indent=4, sort_keys=True))
feature views
entities
data source
features
datasets
data sources
feature references
project: feast_demo_aws
provider: aws
registry: s3://[YOUR BUCKET YOU CREATED]/registry.pb
online_store: null
offline_store:
  type: file
project: feast_demo_gcp
provider: gcp
registry: gs://[YOUR BUCKET YOU CREATED]/registry.pb
online_store: null
offline_store:
  type: file
repo_config = RepoConfig(
    registry=RegistryConfig(path="gs://feast-test-gcs-bucket/registry.pb"),
    project="feast_demo_gcp",
    provider="gcp",
    offline_store="file",  # Could also be the OfflineStoreConfig e.g. FileOfflineStoreConfig
    online_store="null",  # Could also be the OnlineStoreConfig e.g. RedisOnlineStoreConfig
)
store = FeatureStore(config=repo_config)
project: feast_demo_aws
provider: aws
registry: s3://feast-test-s3-bucket/registry.pb
online_store: null
offline_store:
  type: file
store = FeatureStore(repo_path=".")
driver = Entity(name='driver', join_keys=['driver_id'])
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

driver = Entity(name="driver", join_keys=["driver_id"])

driver_stats_fv = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    schema=[
        Field(name="trips_today", dtype=Int64),
        Field(name="earnings_today", dtype=Float32),
    ],
    ttl=timedelta(hours=2),
    source=FileSource(
        path="driver_hourly_stats.parquet"
    )
)
# Read in entity dataframe
entity_df = pd.read_csv("entity_df.csv")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features = [
        'driver_hourly_stats:trips_today',
        'driver_hourly_stats:earnings_today'
    ],
)
from feast import FeatureStore
from feast.infra.offline_stores.bigquery_source import SavedDatasetBigQueryStorage

store = FeatureStore()

historical_job = store.get_historical_features(
    features=["driver:avg_trip"],
    entity_df=...,
)

dataset = store.create_saved_dataset(
    from_=historical_job,
    name='my_training_dataset',
    storage=SavedDatasetBigQueryStorage(table_ref='<gcp-project>.<gcp-dataset>.my_training_dataset'),
    tags={'author': 'oleksii'}
)

dataset.to_df()
dataset = store.get_saved_dataset('my_training_dataset')
dataset.to_df()
Overview
Data ingestion
Entity
Feature view
Feature retrieval
Point-in-time joins
Registry
[Alpha] Saved dataset
more details
Python feature server
SQL Registry
sql.py
point-in-time
open GitHub issue
Data Quality Monitoring
offline store
on demand transformation
tutorial on validating historical features
Overview
Registry
Offline store
Online store
Batch Materialization Engine
Provider
feature services
feature references
feature services
feature references
feature views
Feature views without entities

Online store

Feast uses online stores to serve features at low latency. Feature values are loaded from data sources into the online store through materialization, which can be triggered through the materialize command.

Here is an example batch data source:

Once the above data source is materialized into Feast (using feast materialize), the feature values will be stored as follows:

Registry

The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.

Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports four different backends.

  • Local: Used as a local backend for storing the registry during development

  • S3: Used as a centralized backend for storing the registry on AWS

  • GCS: Used as a centralized backend for storing the registry on GCP

  • [Alpha] Azure: Used as centralized backend for storing the registry on Azure Blob storage.

The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply from the Feast CLI, but metadata about objects can also be updated during operations like materialization.

Users interact with a feature registry through the Feast SDK. Listing all feature views:

fs = FeatureStore("my_feature_repo/")
print(fs.list_feature_views())

Or retrieving a specific feature view:

fs = FeatureStore("my_feature_repo/")
fv = fs.get_feature_view(“my_fv1”)

Overview

Functionality

  • Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.

  • Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply. This CLI command updates infrastructure and persists definitions in the object store registry.

  • Feast Materialize: The user (or scheduler) executes feast materialize which loads features from the offline store into the online store.

  • Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset that can be used for training models.

  • Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.

  • Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.

  • Prediction: A backend system makes a request for a prediction from the model serving service.

  • Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.

Components

A complete Feast deployment contains the following components:

  • Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.

  • Feast Python SDK/CLI: The primary user facing SDK. Used to:

    • Manage version controlled feature definitions.

    • Materialize (load) feature values into the online store.

    • Build and retrieve training datasets from the offline store.

    • Retrieve online features.

  • Stream Processor: The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.

  • Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.

Java and Go Clients are also available for online feature retrieval.

Batch Materialization Engine

A batch materialization engine is a component of Feast that's responsible for moving data from the offline store into the online store.

A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).

Offline store

Offline stores are primarily used for two reasons:

  1. Building training datasets from time-series features.

  2. Materializing (loading) features into an online store to serve those features at low-latency in a production setting.

Only a single offline store can be used at a time. Moreover, offline stores are not compatible with all data sources; for example, the BigQuery offline store cannot be used to query a file-based data source.

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Third party integrations

We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.

Don't see your offline store or online store of choice here? Check out our guides to make a custom one!

Integrations

Standards

In order for a plugin integration to be highlighted, it must meet the following requirements:

  1. The plugin must have some basic documentation on how it should be used.

  2. The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).

In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:

  1. The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.

  2. There is documentation and a tutorial on how to use the integration.

  3. The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.

  4. If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our end-to-end example will perform the following workflows:

  • Computing and backfilling feature data from raw data

  • Building point-in-time correct training datasets from feature data and training a model

  • Making online predictions from feature data

Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):

Sample use-case tutorials

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

This end-to-end tutorial will take you through the following steps:

  • Deploying Redshift as the interface Feast uses to build training datasets

  • Registering your features with Feast and configuring DynamoDB for online serving

  • Building a training dataset with Feast to train your credit scoring model

  • Loading feature values from S3 into DynamoDB

  • Making online predictions with your credit scoring model using features from DynamoDB

Install Feast

Install Feast with Snowflake dependencies (required when using Snowflake):

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Building streaming features

Running Feast with Snowflake/GCP/AWS

Validating historical features with Great Expectations

In this tutorial, we will use the public dataset of Chicago taxi trips to present data validation capabilities of Feast.

  • The original dataset is stored in BigQuery and consists of raw data for each taxi trip (one row per trip) since 2013.

  • We will generate several training datasets (aka historical features in Feast) for different periods and evaluate expectations made on one dataset against another.

Types of features we're ingesting and generating:

  • Features that aggregate raw data with daily intervals (eg, trips per day, average fare or speed for a specific day, etc.).

  • Features using SQL while pulling data from BigQuery (like total trips time or total miles travelled).

  • Features calculated on the fly when requested using Feast's on-demand transformations

Our plan:

  1. Prepare environment

  2. Pull data from BigQuery (optional)

  3. Declare & apply features and feature views in Feast

  4. Generate reference dataset

  5. Develop & test profiler function

  6. Run validation on different dataset using reference dataset & profiler

0. Setup

Install Feast Python SDK and great expectations:

1. Dataset preparation (Optional)

You can skip this step if you don't have GCP account. Please use parquet files that are coming with this tutorial instead

Running some basic aggregations while pulling data from BigQuery. Grouping by taxi_id and day:

2. Declaring features

3. Generating training (reference) dataset

Generating range of timestamps with daily frequency:

Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:

156984 rows × 2 columns

Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:

4. Developing dataset profiler

Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.

Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.

Loading saved dataset first and exploring the data:

156984 rows × 10 columns

Testing our profiler function:

Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).

Now we can create validation reference from dataset and profiler function:

and test it against our existing retrieval job

Validation successfully passed as no exception were raised.

5. Validating new historical retrieval

Creating new timestamps for Dec 2020:

35448 rows × 2 columns

Execute retrieval job with validation reference:

Validation failed since several expectations didn't pass:

  • Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)

  • Average Fare increased - all quantiles are higher than expected

  • Earn per hour (mean) increased more than 10% (most probably due to increased fare)

The storage schema of features within the online store mirrors that of the original data source. One key difference is that for each , only the latest feature values are stored. No historical values are stored.

Features can also be written directly to the online store via .

The feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.

Create Stream Features: Stream features are created from streaming services such as Kafka or Kinesis, and can be pushed directly into Feast via the .

Batch Materialization Engine: The component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.

Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through .

If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see for more details.

Please see for configuring engines.

An offline store is an interface for working with historical time-series feature values that are stored in . The OfflineStore interface has several different implementations, such as the BigQueryOfflineStore, each of which is backed by a different storage and compute engine. For more details on which offline stores are supported, please see .

Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store with your configured data sources to execute the necessary data operations.

Please see for more details on how to push features directly to the offline store in your feature store.

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports as an offline store and as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

See

The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.

Deploying S3 with Parquet as your primary data source, containing both and

Install Feast using :

Feast supports registering streaming feature views and Kafka and Kinesis streaming sources. It also provides an interface for stream processing called the Stream Processor. An example Kafka/Spark StreamProcessor is implemented in the contrib folder. For more details, please see the for more details.

Please see for a tutorial on how to build a versioned streaming pipeline that registers your transformations, features, and data sources in Feast.

The original notebook and datasets for this tutorial can be found on .

Read more about feature views in

Read more about on demand feature views

taxi_id
event_timestamp
total_earned
avg_trip_seconds
taxi_id
total_miles_travelled
trip_count
earned_per_hour
event_timestamp
total_trip_seconds
avg_fare
avg_speed

Feast uses as a validation engine and as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.

taxi_id
event_timestamp
entity key
push sources
Protobuf representation
Push API
Batch Materialization Engine
stream ingestion
this guide
data sources
Offline Stores
feature_store.yaml
Push Source
BigQuery
Datastore
this guide
pip install feast
pip install 'feast[snowflake]'
pip install 'feast[gcp]'
pip install 'feast[aws]'
pip install 'feast[redis]'
!pip install 'feast[ge]'
!pip install google-cloud-bigquery
import pyarrow.parquet

from google.cloud.bigquery import Client
bq_client = Client(project='kf-feast')
data_query = """SELECT
    taxi_id,
    TIMESTAMP_TRUNC(trip_start_timestamp, DAY) as day,
    SUM(trip_miles) as total_miles_travelled,
    SUM(trip_seconds) as total_trip_seconds,
    SUM(fare) as total_earned,
    COUNT(*) as trip_count
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
    trip_miles > 0 AND trip_seconds > 60 AND
    trip_start_timestamp BETWEEN '2019-01-01' and '2020-12-31' AND
    trip_total < 1000
GROUP BY taxi_id, TIMESTAMP_TRUNC(trip_start_timestamp, DAY)"""
driver_stats_table = bq_client.query(data_query).to_arrow()

# Storing resulting dataset into parquet file
pyarrow.parquet.write_table(driver_stats_table, "trips_stats.parquet")
def entities_query(year):
    return f"""SELECT
    distinct taxi_id
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
    trip_miles > 0 AND trip_seconds > 0 AND
    trip_start_timestamp BETWEEN '{year}-01-01' and '{year}-12-31'
"""
entities_2019_table = bq_client.query(entities_query(2019)).to_arrow()

# Storing entities (taxi ids) into parquet file
pyarrow.parquet.write_table(entities_2019_table, "entities.parquet")
import pyarrow.parquet
import pandas as pd

from feast import FeatureView, Entity, FeatureStore, Field, BatchFeatureView
from feast.types import Float64, Int64
from feast.value_type import ValueType
from feast.data_format import ParquetFormat
from feast.on_demand_feature_view import on_demand_feature_view
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.file import SavedDatasetFileStorage
from datetime import timedelta
batch_source = FileSource(
    timestamp_field="day",
    path="trips_stats.parquet",  # using parquet file that we created on previous step
    file_format=ParquetFormat()
)
taxi_entity = Entity(name='taxi', join_keys=['taxi_id'])
trips_stats_fv = BatchFeatureView(
    name='trip_stats',
    entities=['taxi'],
    features=[
        Field(name="total_miles_travelled", dtype=Float64),
        Field(name="total_trip_seconds", dtype=Float64),
        Field(name="total_earned", dtype=Float64),
        Field(name="trip_count", dtype=Int64),

    ],
    ttl=timedelta(seconds=86400),
    source=batch_source,
)
@on_demand_feature_view(
    schema=[
        Field("avg_fare", Float64),
        Field("avg_speed", Float64),
        Field("avg_trip_seconds", Float64),
        Field("earned_per_hour", Float64),
    ],
    sources=[
      trips_stats_fv,
    ]
)
def on_demand_stats(inp):
    out = pd.DataFrame()
    out["avg_fare"] = inp["total_earned"] / inp["trip_count"]
    out["avg_speed"] = 3600 * inp["total_miles_travelled"] / inp["total_trip_seconds"]
    out["avg_trip_seconds"] = inp["total_trip_seconds"] / inp["trip_count"]
    out["earned_per_hour"] = 3600 * inp["total_earned"] / inp["total_trip_seconds"]
    return out
store = FeatureStore(".")  # using feature_store.yaml that stored in the same directory
store.apply([taxi_entity, trips_stats_fv, on_demand_stats])  # writing to the registry
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-06-01", "2019-07-01", freq='D')
entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df

0

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-01

1

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-02

2

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-03

3

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-04

4

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-05

...

...

...

156979

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-27

156980

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-28

156981

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-29

156982

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-30

156983

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-07-01

job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)

store.create_saved_dataset(
    from_=job,
    name='my_training_ds',
    storage=SavedDatasetFileStorage(path='my_training_ds.parquet')
)
<SavedDataset(name = my_training_ds, features = ['trip_stats:total_miles_travelled', 'trip_stats:total_trip_seconds', 'trip_stats:total_earned', 'trip_stats:trip_count', 'on_demand_stats:avg_fare', 'on_demand_stats:avg_trip_seconds', 'on_demand_stats:avg_speed', 'on_demand_stats:earned_per_hour'], join_keys = ['taxi_id'], storage = <feast.infra.offline_stores.file_source.SavedDatasetFileStorage object at 0x1276e7950>, full_feature_names = False, tags = {}, _retrieval_job = <feast.infra.offline_stores.file.FileRetrievalJob object at 0x12716fed0>, min_event_timestamp = 2019-06-01 00:00:00, max_event_timestamp = 2019-07-01 00:00:00)>
import numpy as np

from feast.dqm.profilers.ge_profiler import ge_profiler

from great_expectations.core.expectation_suite import ExpectationSuite
from great_expectations.dataset import PandasDataset
ds = store.get_saved_dataset('my_training_ds')
ds.to_df()

0

68.25

2270.000000

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

24.70

2.0

54.118943

2019-06-01 00:00:00+00:00

4540.0

34.125000

19.585903

1

221.00

560.500000

7a4a6162eaf27805aef407d25d5cb21fe779cd962922cb...

54.18

24.0

59.143622

2019-06-01 00:00:00+00:00

13452.0

9.208333

14.499554

2

160.50

1010.769231

f4c9d05b215d7cbd08eca76252dae51cdb7aca9651d4ef...

41.30

13.0

43.972603

2019-06-01 00:00:00+00:00

13140.0

12.346154

11.315068

3

183.75

697.550000

c1f533318f8480a59173a9728ea0248c0d3eb187f4b897...

37.30

20.0

47.415956

2019-06-01 00:00:00+00:00

13951.0

9.187500

9.625116

4

217.75

1054.076923

455b6b5cae6ca5a17cddd251485f2266d13d6a2c92f07c...

69.69

13.0

57.206451

2019-06-01 00:00:00+00:00

13703.0

16.750000

18.308692

...

...

...

...

...

...

...

...

...

...

...

156979

38.00

1980.000000

0cccf0ec1f46d1e0beefcfdeaf5188d67e170cdff92618...

14.90

1.0

69.090909

2019-07-01 00:00:00+00:00

1980.0

38.000000

27.090909

156980

135.00

551.250000

beefd3462e3f5a8e854942a2796876f6db73ebbd25b435...

28.40

16.0

55.102041

2019-07-01 00:00:00+00:00

8820.0

8.437500

11.591837

156981

NaN

NaN

9a3c52aa112f46cf0d129fafbd42051b0fb9b0ff8dcb0e...

NaN

NaN

NaN

2019-07-01 00:00:00+00:00

NaN

NaN

NaN

156982

63.00

815.000000

08308c31cd99f495dea73ca276d19a6258d7b4c9c88e43...

19.96

4.0

69.570552

2019-07-01 00:00:00+00:00

3260.0

15.750000

22.041718

156983

NaN

NaN

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

NaN

NaN

NaN

2019-07-01 00:00:00+00:00

NaN

NaN

NaN

DELTA = 0.1  # controlling allowed window in fraction of the value on scale [0, 1]

@ge_profiler
def stats_profiler(ds: PandasDataset) -> ExpectationSuite:
    # simple checks on data consistency
    ds.expect_column_values_to_be_between(
        "avg_speed",
        min_value=0,
        max_value=60,
        mostly=0.99  # allow some outliers
    )

    ds.expect_column_values_to_be_between(
        "total_miles_travelled",
        min_value=0,
        max_value=500,
        mostly=0.99  # allow some outliers
    )

    # expectation of means based on observed values
    observed_mean = ds.trip_count.mean()
    ds.expect_column_mean_to_be_between("trip_count",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))

    observed_mean = ds.earned_per_hour.mean()
    ds.expect_column_mean_to_be_between("earned_per_hour",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))


    # expectation of quantiles
    qs = [0.5, 0.75, 0.9, 0.95]
    observed_quantiles = ds.avg_fare.quantile(qs)

    ds.expect_column_quantile_values_to_be_between(
        "avg_fare",
        quantile_ranges={
            "quantiles": qs,
            "value_ranges": [[None, max_value] for max_value in observed_quantiles]
        })

    return ds.get_expectation_suite()
ds.get_profile(profiler=stats_profiler)
02/02/2022 02:43:47 PM INFO:	5 expectation(s) included in expectation_suite. result_format settings filtered.
<GEProfile with expectations: [
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "avg_speed",
      "min_value": 0,
      "max_value": 60,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "total_miles_travelled",
      "min_value": 0,
      "max_value": 500,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "trip_count",
      "min_value": 10.387244591346153,
      "max_value": 12.695521167200855
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "earned_per_hour",
      "min_value": 52.320624975640214,
      "max_value": 63.94743052578249
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_quantile_values_to_be_between",
    "kwargs": {
      "column": "avg_fare",
      "quantile_ranges": {
        "quantiles": [
          0.5,
          0.75,
          0.9,
          0.95
        ],
        "value_ranges": [
          [
            null,
            16.4
          ],
          [
            null,
            26.229166666666668
          ],
          [
            null,
            36.4375
          ],
          [
            null,
            42.0
          ]
        ]
      }
    },
    "meta": {}
  }
]>
validation_reference = ds.as_reference(profiler=stats_profiler)
_ = job.to_df(validation_reference=validation_reference)
02/02/2022 02:43:52 PM INFO: 5 expectation(s) included in expectation_suite. result_format settings filtered.
02/02/2022 02:43:53 PM INFO: Validating data_asset_name None with expectation_suite_name default
from feast.dqm.errors import ValidationFailed
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2020-12-01", "2020-12-07", freq='D')
entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df

0

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-01

1

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-02

2

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-03

3

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-04

4

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-05

...

...

...

35443

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-03

35444

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-04

35445

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-05

35446

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-06

35447

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-07

job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)
try:
    df = job.to_df(validation_reference=validation_reference)
except ValidationFailed as exc:
    print(exc.validation_report)
02/02/2022 02:43:58 PM INFO: 5 expectation(s) included in expectation_suite. result_format settings filtered.
02/02/2022 02:43:59 PM INFO: Validating data_asset_name None with expectation_suite_name default

[
  {
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "trip_count",
        "min_value": 10.387244591346153,
        "max_value": 12.695521167200855,
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": 6.692920555429092,
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  },
  {
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "earned_per_hour",
        "min_value": 52.320624975640214,
        "max_value": 63.94743052578249,
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": 68.99268345164135,
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  },
  {
    "expectation_config": {
      "expectation_type": "expect_column_quantile_values_to_be_between",
      "kwargs": {
        "column": "avg_fare",
        "quantile_ranges": {
          "quantiles": [
            0.5,
            0.75,
            0.9,
            0.95
          ],
          "value_ranges": [
            [
              null,
              16.4
            ],
            [
              null,
              26.229166666666668
            ],
            [
              null,
              36.4375
            ],
            [
              null,
              42.0
            ]
          ]
        },
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": {
        "quantiles": [
          0.5,
          0.75,
          0.9,
          0.95
        ],
        "values": [
          19.5,
          28.1,
          38.0,
          44.125
        ]
      },
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154,
      "details": {
        "success_details": [
          false,
          false,
          false,
          false
        ]
      }
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  }
]
Adding a new offline store
Adding a new online store
Functionality and Roadmap
guide
Fraud Detection Example
Driver ranking
Fraud detection on GCP
Real-time credit scoring on AWS
Driver stats on Snowflake
Real-time Credit Scoring Example
loan features
zip code features
pip
RFC
here
Install Feast
Create a feature repository
Deploy a feature store
Build a training dataset
Load data into the online store
Read features from the online store
Scaling Feast
Structuring Feature Repos
GitHub
Feast docs
here
Great Expectations
ExpectationSuite
PandasDataset

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Retrieving online features

1. Ensure that feature values have been loaded into the online store

Please ensure that you have materialized (loaded) your feature values into the online store before starting

2. Define feature references

Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

features = [
    "driver_hourly_stats:conv_rate",
    "driver_hourly_stats:acc_rate"
]

3. Read online features

Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

fs = FeatureStore(repo_path="path/to/feature/repo")
online_features = fs.get_online_features(
    features=features,
    entity_rows=[
        # {join_key: entity_value, ...}
        {"driver_id": 1001},
        {"driver_id": 1002}]
).to_dict()
{
   "driver_hourly_stats__acc_rate":[
      0.2897740304470062,
      0.6447265148162842
   ],
   "driver_hourly_stats__conv_rate":[
      0.6508077383041382,
      0.14802511036396027
   ],
   "driver_id":[
      1001,
      1002
   ]
}

Scaling Feast

Overview

Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.

Scaling Feast Registry

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

Scaling Materialization

The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.

Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.

Create a feature repository

The easiest way to create a new feature repository to use feast init command:

feast init

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t snowflake
Snowflake Deployment URL: ...
Snowflake User Name: ...
Snowflake Password: ...
Snowflake Role Name: ...
Snowflake Warehouse Name: ...
Snowflake Database Name: ...

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t gcp

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t aws
AWS Region (e.g. us-west-2): ...
Redshift Cluster ID: ...
Redshift Database Name: ...
Redshift User Name: ...
Redshift S3 Staging Location (s3://*): ...
Redshift IAM Role for S3 (arn:aws:iam::*:role/*): ...
Should I upload example data to Redshift (overwriting 'feast_driver_hourly_stats' table)? (Y/n):

Creating a new Feast repository in /<...>/tiny_pika.

The init command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:

$ tree
.
└── tiny_pika
    ├── data
    │   └── driver_stats.parquet
    ├── example.py
    └── feature_store.yaml

1 directory, 3 files

Enter the directory:

# Replace "tiny_pika" with your auto-generated dir name
cd tiny_pika

You can now use this feature repository for development. You can try the following:

  • Run feast apply to apply these definitions to Feast.

  • Edit the example feature definitions in example.py and run feast apply again to change feature definitions.

  • Initialize a git repository in the same directory and checking the feature repository into version control.

Load data into the online store

Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

Materializing features

1. Register feature views

Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

2.a Materialize

The materialize command allows users to materialize features over a specific historical time range into the online store.

feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00

The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

It is also possible to materialize for specific feature views by using the -v / --views argument.

feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00 \
--views driver_hourly_stats

The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

2.b Materialize Incremental (Alternative)

For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize, materialize-incremental will track the state of previous ingestion runs inside of the feature registry.

The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00).

feast materialize-incremental 2021-04-08T00:00:00

The materialize-incremental command functions similarly to materialize in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.

Unlike materialize, materialize-incremental automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental, the end timestamp will be tracked.

Subsequent runs of materialize-incremental will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.

Driver stats on Snowflake

Initial demonstration of Snowflake as an offline+online store with Feast, using the Snowflake demo template.

In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store + materialization engine + online store.

Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.

Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.

Snowflake Offline Store Example

Install feast-snowflake

pip install 'feast[snowflake]'

Get a Snowflake Trial Account (Optional)

Create a feature repository

feast init -t snowflake {feature_repo_name}
Snowflake Deployment URL (exclude .snowflakecomputing.com):
Snowflake User Name::
Snowflake Password::
Snowflake Role Name (Case Sensitive)::
Snowflake Warehouse Name (Case Sensitive)::
Snowflake Database Name (Case Sensitive)::
Should I upload example data to Snowflake (overwrite table)? [Y/n]: Y
cd {feature_repo_name}

The following files will automatically be created in your project folder:

  • feature_store.yaml -- This is your main configuration file

  • driver_repo.py -- This is your main feature definition file

  • test.py -- This is a file to test your feature store configuration

Inspect feature_store.yaml

Here you will see the information that you entered. This template will use Snowflake as the offline store, materialization engine, and the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.

feature_store.yaml
project: ...
registry: ...
provider: local
offline_store:
    type: snowflake.offline
    account: SNOWFLAKE_DEPLOYMENT_URL #drop .snowflakecomputing.com
    user: USERNAME
    password: PASSWORD
    role: ROLE_NAME #case sensitive
    warehouse: WAREHOUSE_NAME #case sensitive
    database: DATABASE_NAME #case cap sensitive
batch_engine:
    type: snowflake.engine
    account: SNOWFLAKE_DEPLOYMENT_URL #drop .snowflakecomputing.com
    user: USERNAME
    password: PASSWORD
    role: ROLE_NAME #case sensitive
    warehouse: WAREHOUSE_NAME #case sensitive
    database: DATABASE_NAME #case cap sensitive
online_store:
    type: snowflake.online
    account: SNOWFLAKE_DEPLOYMENT_URL #drop .snowflakecomputing.com
    user: USERNAME
    password: PASSWORD
    role: ROLE_NAME #case sensitive
    warehouse: WAREHOUSE_NAME #case sensitive
    database: DATABASE_NAME #case cap sensitive

Run our test python script test.py

python test.py

What we did in test.py

Initialize our Feature Store

test.py
from datetime import datetime, timedelta

import pandas as pd
from driver_repo import driver, driver_stats_fv

from feast import FeatureStore

fs = FeatureStore(repo_path=".")

fs.apply([driver, driver_stats_fv])

Create a dummy training dataframe, then call our offline store to add additional columns

test.py
entity_df = pd.DataFrame(
    {
        "event_timestamp": [
            pd.Timestamp(dt, unit="ms", tz="UTC").round("ms")
            for dt in pd.date_range(
                start=datetime.now() - timedelta(days=3),
                end=datetime.now(),
                periods=3,
            )
        ],
        "driver_id": [1001, 1002, 1003],
    }
)

features = ["driver_hourly_stats:conv_rate", "driver_hourly_stats:acc_rate"]

training_df = fs.get_historical_features(
    features=features, entity_df=entity_df
).to_df()

Materialize the latest feature values into our online store

test.py
fs.materialize_incremental(end_date=datetime.now())

Retrieve the latest values from our online store based on our entity key

test.py
online_features = fs.get_online_features(
    features=features,
    entity_rows=[
      # {join_key: entity_value}
      {"driver_id": 1001},
      {"driver_id": 1002}
    ],
).to_dict()

Customizing Feast

Feast is highly pluggable and configurable:

  • One can use existing plugins (offline store, online store, batch materialization engine, providers) and configure those using the built in options. See reference documentation for details.

  • The other way to customize Feast is to build your own custom components, and then point Feast to delegate to them.

Below are some guides on how to add new custom components:

Adding a custom batch materialization engine

Overview

Feast batch materialization operations (materialize and materialize-incremental) execute through a BatchMaterializationEngine.

Custom batch materialization engines allow Feast users to extend Feast to customize the materialization process. Examples include:

  • Setting up custom materialization-specific infrastructure during feast apply (e.g. setting up Spark clusters or Lambda Functions)

  • Launching custom batch ingestion (materialization) jobs (Spark, Beam, AWS Lambda)

  • Tearing down custom materialization-specific infrastructure during feast teardown (e.g. tearing down Spark clusters, or deleting Lambda Functions)

Guide

The fastest way to add custom logic to Feast is to extend an existing materialization engine. The most generic engine is the LocalMaterializationEngine which contains no cloud-specific logic. The guide that follows will extend the LocalProvider with operations that print text to the console. It is up to you as a developer to add your custom code to the engine methods, but the guide below will provide the necessary scaffolding to get you started.

Step 1: Define an Engine class

The first step is to define a custom materialization engine class. We've created the MyCustomEngine below.

from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union

from feast.entity import Entity
from feast.feature_view import FeatureView
from feast.batch_feature_view import BatchFeatureView
from feast.stream_feature_view import StreamFeatureView
from feast.infra.materialization import LocalMaterializationEngine, LocalMaterializationJob, MaterializationTask
from feast.infra.offline_stores.offline_store import OfflineStore
from feast.infra.online_stores.online_store import OnlineStore
from feast.repo_config import RepoConfig


class MyCustomEngine(LocalMaterializationEngine):
    def __init__(
            self,
            *,
            repo_config: RepoConfig,
            offline_store: OfflineStore,
            online_store: OnlineStore,
            **kwargs,
    ):
        super().__init__(
            repo_config=repo_config,
            offline_store=offline_store,
            online_store=online_store,
            **kwargs,
        )

    def update(
            self,
            project: str,
            views_to_delete: Sequence[
                Union[BatchFeatureView, StreamFeatureView, FeatureView]
            ],
            views_to_keep: Sequence[
                Union[BatchFeatureView, StreamFeatureView, FeatureView]
            ],
            entities_to_delete: Sequence[Entity],
            entities_to_keep: Sequence[Entity],
    ):
        print("Creating new infrastructure is easy here!")
        pass

    def materialize(
        self, registry, tasks: List[MaterializationTask]
    ) -> List[LocalMaterializationJob]:
        print("Launching custom batch jobs or multithreading things is pretty easy...")
        return [
            self._materialize_one(
                registry,
                task.feature_view,
                task.start_time,
                task.end_time,
                task.project,
                task.tqdm_builder,
            )
            for task in tasks
        ]

Notice how in the above engine we have only overwritten two of the methods on the LocalMaterializatinEngine, namely update and materialize. These two methods are convenient to replace if you are planning to launch custom batch jobs.

Step 2: Configuring Feast to use the engine

project: repo
registry: registry.db
batch_engine: feast_custom_engine.MyCustomEngine
online_store:
    type: sqlite
    path: online_store.db
offline_store:
    type: file

Notice how the batch_engine field above points to the module and class where your engine can be found.

Step 3: Using the engine

Now you should be able to use your engine by running a Feast command:

feast apply
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats
Creating new infrastructure is easy here!

It may also be necessary to add the module root path to your PYTHONPATH as follows:

PYTHONPATH=$PYTHONPATH:/home/my_user/my_custom_engine feast apply

That's it. You should now have a fully functional custom engine!

Upgrading for Feast 0.20+

Overview

Starting with Feast 0.20, the APIs of many core objects (e.g. feature views and entities) have been changed. For example, many parameters have been renamed. These changes were made in a backwards-compatible fashion; existing Feast repositories will continue to work until Feast 0.23, without any changes required. However, Feast 0.24 will fully deprecate all of the old parameters, so in order to use Feast 0.24+ users must modify their Feast repositories.

There are currently deprecation warnings that indicate to users exactly how to modify their repos. In order to make the process somewhat easier, Feast 0.23 also introduces a new CLI command, repo-upgrade, that will partially automate the process of upgrading Feast repositories.

The repo-upgrade command is specifically meant for upgrading Feast repositories that were initially created in versions 0.23 and below to be compatible with versions 0.24 and above. It is not intended to work for any future upgrades.

Usage

At the root of a feature repo, you can run feast repo-upgrade. By default, the CLI only echos the changes it's planning on making, and does not modify any files in place. If the changes look reasonably, you can specify the --write flag to have the changes be written out to disk.

An example:

$ feast repo-upgrade --write
--- /Users/achal/feast/prompt_dory/example.py
+++ /Users/achal/feast/prompt_dory/example.py
@@ -13,7 +13,6 @@
     path="/Users/achal/feast/prompt_dory/data/driver_stats.parquet",
     event_timestamp_column="event_timestamp",
     created_timestamp_column="created",
-    date_partition_column="created"
 )

 # Define an entity for the driver. You can think of entity as a primary key used to
--- /Users/achal/feast/prompt_dory/example.py
+++ /Users/achal/feast/prompt_dory/example.py
@@ -3,7 +3,7 @@
 from google.protobuf.duration_pb2 import Duration
 import pandas as pd

-from feast import Entity, Feature, FeatureView, FileSource, ValueType, FeatureService, OnDemandFeatureView
+from feast import Entity, FeatureView, FileSource, ValueType, FeatureService, OnDemandFeatureView

 # Read data from parquet files. Parquet is convenient for local development mode. For
 # production, you can use your favorite DWH, such as BigQuery. See Feast documentation
--- /Users/achal/feast/prompt_dory/example.py
+++ /Users/achal/feast/prompt_dory/example.py
@@ -4,6 +4,7 @@
 import pandas as pd

 from feast import Entity, Feature, FeatureView, FileSource, ValueType, FeatureService, OnDemandFeatureView
+from feast import Field

 # Read data from parquet files. Parquet is convenient for local development mode. For
 # production, you can use your favorite DWH, such as BigQuery. See Feast documentation
--- /Users/achal/feast/prompt_dory/example.py
+++ /Users/achal/feast/prompt_dory/example.py
@@ -28,9 +29,9 @@
     entities=[driver_id],
     ttl=Duration(seconds=86400 * 365),
     features=[
-        Feature(name="conv_rate", dtype=ValueType.FLOAT),
-        Feature(name="acc_rate", dtype=ValueType.FLOAT),
-        Feature(name="avg_daily_trips", dtype=ValueType.INT64),
+        Field(name="conv_rate", dtype=ValueType.FLOAT),
+        Field(name="acc_rate", dtype=ValueType.FLOAT),
+        Field(name="avg_daily_trips", dtype=ValueType.INT64),
     ],
     online=True,
     batch_source=driver_hourly_stats,

To write these changes out, you can run the same command with the --write flag:

$ feast repo-upgrade  --write

You should see the same output, but also see the changes reflected in your feature repo on disk.

Adding a new offline store

Overview

In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.

The process for using a custom offline store consists of 8 steps:

  1. Defining an OfflineStore class.

  2. Defining an OfflineStoreConfig class.

  3. Defining a RetrievalJob class for this offline store.

  4. Defining a DataSource class for the offline store

  5. Referencing the OfflineStore in a feature repo's feature_store.yaml file.

  6. Testing the OfflineStore class.

  7. Updating dependencies.

  8. Adding documentation.

1. Defining an OfflineStore class

OfflineStore class names must end with the OfflineStore suffix!

Contrib offline stores

New offline stores go in sdk/python/feast/infra/offline_stores/contrib/.

What is a contrib plugin?

  • Not guaranteed to implement all interface methods

  • Not guaranteed to be stable.

  • Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.

How do I make a contrib plugin an "official" plugin?

To move an offline store plugin out of contrib, you need:

  • GitHub actions (i.e make test-python-integration) is setup to run all tests against the offline store and pass.

  • At least two contributors own the plugin (ideally tracked in our OWNERS / CODEOWNERS file).

Define the offline store class

The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.

To fully implement the interface for the offline store, you will need to implement these methods:

  • pull_latest_from_table_or_query is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method. This method pull data from the offline store, and the FeatureStore class takes care of writing this data into the online store.

  • get_historical_features is invoked when reading values from the offline store using the FeatureStore.get_historical_features() method. Typically, this method is used to retrieve features when training ML models.

  • (optional) pull_all_from_table_or_query is a method that pulls all the data from an offline store from a specified start date to a specified end date. This method is only used for SavedDatasets as part of data quality monitoring validation.

  • (optional) write_logged_features is a method that takes a pyarrow table or a path that points to a parquet file and writes the data to a defined source defined by LoggingSource and LoggingConfig. This method is only used internally for SavedDatasets.

feast_custom_offline_store/file.py
    # Only prints out runtime warnings once.
    warnings.simplefilter("once", RuntimeWarning)

    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:
        """ Perform point-in-time correct join of features onto an entity dataframe(entity key and timestamp). More details about how this should work at https://docs.feast.dev/v/v0.6-branch/user-guide/feature-retrieval#3.-historical-feature-retrieval.
        print("Getting historical features from my offline store")."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def pull_latest_from_table_or_query(self,
                                        config: RepoConfig,
                                        data_source: DataSource,
                                        join_key_columns: List[str],
                                        feature_name_columns: List[str],
                                        timestamp_field: str,
                                        created_timestamp_column: Optional[str],
                                        start_date: datetime,
                                        end_date: datetime) -> RetrievalJob:
        """ Pulls data from the offline store for use in materialization."""
        print("Pulling latest features from my offline store")
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def pull_all_from_table_or_query(
        config: RepoConfig,
        data_source: DataSource,
        join_key_columns: List[str],
        feature_name_columns: List[str],
        timestamp_field: str,
        start_date: datetime,
        end_date: datetime,
    ) -> RetrievalJob:
        """ Optional method that returns a Retrieval Job for all join key columns, feature name columns, and the event timestamp columns that occur between the start_date and end_date."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def write_logged_features(
        config: RepoConfig,
        data: Union[pyarrow.Table, Path],
        source: LoggingSource,
        logging_config: LoggingConfig,
        registry: BaseRegistry,
    ):
        """ Optional method to have Feast support logging your online features."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

    def offline_write_batch(
        config: RepoConfig,
        feature_view: FeatureView,
        table: pyarrow.Table,
        progress: Optional[Callable[[int], Any]],
    ):
        """ Optional method to have Feast support the offline push api for your offline store."""
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        # Implementation here.
        pass

1.1 Type Mapping

Most offline stores will have to perform some custom mapping of offline store datatypes to feast value types.

  • The function to implement here are source_datatype_to_feast_value_type and get_column_names_and_types in your DataSource class.

  • source_datatype_to_feast_value_type is used to convert your DataSource's datatypes to feast value types.

  • get_column_names_and_types retrieves the column names and corresponding datasource types.

Add any helper functions for type conversion to sdk/python/feast/type_map.py.

  • Be sure to implement correct type mapping so that Feast can process your feature columns without casting incorrectly that can potentially cause loss of information or incorrect data.

2. Defining an OfflineStoreConfig class

Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.

This config class must container a type field, which contains the fully qualified class name of its corresponding OfflineStore class.

Additionally, the name of the config class must be the same as the OfflineStore class, with the Config suffix.

An example of the config class for the custom file offline store :

feast_custom_offline_store/file.py
class CustomFileOfflineStoreConfig(FeastConfigBaseModel):
    """ Custom offline store config for local (file-based) store """

    type: Literal["feast_custom_offline_store.file.CustomFileOfflineStore"] \
        = "feast_custom_offline_store.file.CustomFileOfflineStore"

    uri: str # URI for your offline store(in this case it would be a path)

This configuration can be specified in the feature_store.yaml as follows:

feature_repo/feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: feast_custom_offline_store.file.CustomFileOfflineStore
    uri: <File URI>
online_store:
    path: data/online_store.db

This configuration information is available to the methods of the OfflineStore, via the config: RepoConfig parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store field of the config parameter. This fields in the feature_store.yaml should map directly to your OfflineStoreConfig class that is detailed above in Section 2.

feast_custom_offline_store/file.py
    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:
        warnings.warn(
            "This offline store is an experimental feature in alpha development. "
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        offline_store_config = config.offline_store
        assert isinstance(offline_store_config, CustomFileOfflineStoreConfig)
        store_type = offline_store_config.type

3. Defining a RetrievalJob class

The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob instance, which represents the execution of the actual query against the underlying store.

Custom offline stores may need to implement their own instances of the RetrievalJob interface.

The RetrievalJob interface exposes two methods - to_df and to_arrow. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.

feast_custom_offline_store/file.py
class CustomFileRetrievalJob(RetrievalJob):
    def __init__(self, evaluation_function: Callable):
        """Initialize a lazy historical retrieval job"""

        # The evaluation function executes a stored procedure to compute a historical retrieval.
        self.evaluation_function = evaluation_function

    def to_df(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return df

    def to_arrow(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return pyarrow.Table.from_pandas(df)

    def to_remote_storage(self):
        # Optional method to write to an offline storage location to support scalable batch materialization.
        pass

4. Defining a DataSource class for the offline store

The data source class should implement two methods - from_proto, and to_proto.

For custom offline stores that are not being implemented in the main feature repo, the custom_options field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto method and reading the value back from bytes in the from_proto method.

feast_custom_offline_store/file.py
class CustomFileDataSource(FileSource):
    """Custom data source class for local files"""
    def __init__(
        self,
        timestamp_field: Optional[str] = "",
        path: Optional[str] = None,
        field_mapping: Optional[Dict[str, str]] = None,
        created_timestamp_column: Optional[str] = "",
        date_partition_column: Optional[str] = "",
    ):
            "Some functionality may still be unstable so functionality can change in the future.",
            RuntimeWarning,
        )
        super(CustomFileDataSource, self).__init__(
            timestamp_field=timestamp_field,
            created_timestamp_column,
            field_mapping,
            date_partition_column,
        )
        self._path = path


    @staticmethod
    def from_proto(data_source: DataSourceProto):
        custom_source_options = str(
            data_source.custom_options.configuration, encoding="utf8"
        )
        path = json.loads(custom_source_options)["path"]
        return CustomFileDataSource(
            field_mapping=dict(data_source.field_mapping),
            path=path,
            timestamp_field=data_source.timestamp_field,
            created_timestamp_column=data_source.created_timestamp_column,
            date_partition_column=data_source.date_partition_column,
        )

    def to_proto(self) -> DataSourceProto:
        config_json = json.dumps({"path": self.path})
        data_source_proto = DataSourceProto(
            type=DataSourceProto.CUSTOM_SOURCE,
            custom_options=DataSourceProto.CustomSourceOptions(
                configuration=bytes(config_json, encoding="utf8")
            ),
        )

        data_source_proto.timestamp_field = self.timestamp_field
        data_source_proto.created_timestamp_column = self.created_timestamp_column
        data_source_proto.date_partition_column = self.date_partition_column

        return data_source_proto

5. Using the custom offline store

After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the offline_store field. The value specified should be the fully qualified class name of the OfflineStore.

As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our custom file offline store, we can use the following feature_store.yaml:

feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
offline_store:
    # Make sure to specify the type as the fully qualified path that Feast can import.
    type: feast_custom_offline_store.file.CustomFileOfflineStore

If additional configuration for the offline store is not required, then we can omit the other fields and only specify the type of the offline store class as the value for the offline_store.

feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
offline_store: feast_custom_offline_store.file.CustomFileOfflineStore

Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.

feature_repo/repo.py
driver_hourly_stats = CustomFileDataSource(
    path="feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)


driver_hourly_stats_view = FeatureView(
    source=driver_hourly_stats,
    ...
)

6. Testing the OfflineStore class

Integrating with the integration test suite and unit test suite.

Even if you have created the OfflineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.

  1. In order to test against the test suite, you need to create a custom DataSourceCreator that implement our testing infrastructure methods, create_data_source and optionally, created_saved_dataset_destination.

    • create_data_source should create a datasource based on the dataframe passed in. It may be implemented by uploading the contents of the dataframe into the offline store and returning a datasource object pointing to that location. See BigQueryDataSourceCreator for an implementation of a data source creator.

    • created_saved_dataset_destination is invoked when users need to save the dataset for use in data validation. This functionality is still in alpha and is optional.

  2. Make sure that your offline store doesn't break any unit tests first by running:

    make test-python
  3. Next, set up your offline store to run the universal integration tests. These are integration tests specifically intended to test offline and online stores against Feast API functionality, to ensure that the Feast APIs works with your offline store.

    • Feast parametrizes integration tests using the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py which stores different offline store classes for testing.

    • To overwrite the default configurations to use your own offline store, you can simply create your own file that contains a FULL_REPO_CONFIGS dictionary, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. The module should add new IntegrationTestRepoConfig classes to the AVAILABLE_OFFLINE_STORES by defining an offline store that you would like Feast to test with.

    A sample FULL_REPO_CONFIGS_MODULE looks something like this:

    # Should go in sdk/python/feast/infra/offline_stores/contrib/postgres_repo_configuration.py
    from feast.infra.offline_stores.contrib.postgres_offline_store.tests.data_source import (
        PostgreSQLDataSourceCreator,
    )
    
    AVAILABLE_OFFLINE_STORES = [("local", PostgreSQLDataSourceCreator)]
  4. You should swap out the FULL_REPO_CONFIGS environment variable and run the integration tests against your offline store. In the example repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_offline_store/feast_tests.py, so you would run:

    export FULL_REPO_CONFIGS_MODULE='feast_custom_offline_store.feast_tests'
    make test-python-universal

    If the integration tests fail, this indicates that there is a mistake in the implementation of this offline store!

  5. Remember to add your datasource to repo_config.py similar to how we added spark, trino, etc, to the dictionary OFFLINE_STORE_CLASS_FOR_TYPE. This will allow Feast to load your class from the feature_store.yaml.

  6. Finally, add a Makefile target to the Makefile to run your datastore specific tests by setting the FULL_REPO_CONFIGS_MODULE and PYTEST_PLUGINS environment variable. The PYTEST_PLUGINS environment variable allows pytest to load in the DataSourceCreator for your datasource. You can remove certain tests that are not relevant or still do not work for your datastore using the -k option.

Makefile
test-python-universal-spark:
	PYTHONPATH='.' \
	FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.offline_stores.contrib.spark_repo_configuration \
	PYTEST_PLUGINS=feast.infra.offline_stores.contrib.spark_offline_store.tests \
 	FEAST_USAGE=False IS_TEST=True \
 	python -m pytest -n 8 --integration \
 	 	-k "not test_historical_retrieval_fails_on_validation and \
			not test_historical_retrieval_with_validation and \
			not test_historical_features_persisting and \
			not test_historical_retrieval_fails_on_validation and \
			not test_universal_cli and \
			not test_go_feature_server and \
			not test_feature_logging and \
			not test_reorder_columns and \
			not test_logged_features_validation and \
			not test_lambda_materialization_consistency and \
			not test_offline_write and \
			not test_push_features_to_offline_store.py and \
			not gcs_registry and \
			not s3_registry and \
			not test_universal_types" \
 	 sdk/python/tests

7. Dependencies

Add any dependencies for your offline store to our sdk/python/setup.py under a new <OFFLINE_STORE>__REQUIRED list with the packages and add it to the setup script so that if your offline store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default. You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands:

export PYTHON=<version>
make lock-python-ci-dependencies

8. Add Documentation

Remember to add documentation for your offline store.

  1. Add a new markdown file to docs/reference/offline-stores/ and docs/reference/data-sources/. Use these files to document your offline store functionality similar to how the other offline stores are documented.

  2. You should also add a reference in docs/reference/data-sources/README.md and docs/SUMMARY.md to these markdown files.

NOTE: Be sure to document the following things about your offline store:

  • How to create the datasource and most what configuration is needed in the feature_store.yaml file in order to create the datasource.

  • Make sure to flag that the datasource is in alpha development.

  • Add some documentation on what the data model is for the specific offline store for more clarity.

  • Finally, generate the python code docs by running:

make build-sphinx

Type System

Motivation

Examples

Feature inference

During feast apply, Feast runs schema inference on the data sources underlying feature views. For example, if the schema parameter is not specified for a feature view, Feast will examine the schema of the underlying data source to determine the event timestamp column, feature columns, and entity columns. Each of these columns must be associated with a Feast type, which requires conversion from the data source type system to the Feast type system.

  • The feature inference logic calls _infer_features_and_entities.

  • _infer_features_and_entities calls source_datatype_to_feast_value_type.

  • source_datatype_to_feast_value_type cals the appropriate method in type_map.py. For example, if a SnowflakeSource is being examined, snowflake_python_type_to_feast_value_type from type_map.py will be called.

Materialization

  • The local materialization engine first pulls the latest historical features and converts it to pyarrow.

  • Then it calls _convert_arrow_to_proto to convert the pyarrow table to proto format.

  • This calls python_values_to_proto_values in type_map.py to perform the type conversion.

Historical feature retrieval

The Feast type system is typically not necessary when retrieving historical features. A call to get_historical_features will return a RetrievalJob object, which allows the user to export the results to one of several possible locations: a Pandas dataframe, a pyarrow table, a data lake (e.g. S3 or GCS), or the offline store (e.g. a Snowflake table). In all of these cases, the type conversion is handled natively by the offline store. For example, a BigQuery query exposes a to_dataframe method that will automatically convert the result to a dataframe, without requiring any conversions within Feast.

Feature serving

Adding or reusing tests

Overview

This guide will go over:

  1. how Feast tests are setup

  2. how to extend the test suite to test new functionality

  3. how to use the existing test suite to test a new custom offline / online store

Test suite overview

Unit tests are contained in sdk/python/tests/unit. Integration tests are contained in sdk/python/tests/integration. Let's inspect the structure of sdk/python/tests/integration:

  • feature_repos has setup files for most tests in the test suite.

  • The tests are organized by which Feast component(s) they test.

Structure of the test suite

Universal feature repo

The universal feature repo refers to a set of fixtures (e.g. environment and universal_data_sources) that can be parametrized to cover various combinations of offline stores, online stores, and providers. This allows tests to run against all these various combinations without requiring excess code. The universal feature repo is constructed by fixtures in conftest.py with help from the various files in feature_repos.

Integration vs. unit tests

Tests in Feast are split into integration and unit tests. If a test requires external resources (e.g. cloud resources on GCP or AWS), it is an integration test. If a test can be run purely locally (where locally includes Docker resources), it is a unit test.

  • Integration tests test non-local Feast behavior. For example, tests that require reading data from BigQuery or materializing data to DynamoDB are integration tests. Integration tests also tend to involve more complex Feast functionality.

  • Unit tests test local Feast behavior. For example, tests that only require registering feature views are unit tests. Unit tests tend to only involve simple Feast functionality.

Main types of tests

Integration tests

  1. E2E tests

    • E2E tests test end-to-end functionality of Feast over the various codepaths (initialize a feature store, apply, and materialize).

    • The main codepaths include:

      • basic e2e tests for offline stores

        • test_universal_e2e.py

      • go feature server

        • test_go_feature_server.py

      • python http server

        • test_python_feature_server.py

      • usage tracking

        • test_usage_e2e.py

      • data quality monitoring feature validation

        • test_validation.py

  2. Offline and Online Store Tests

    • Offline and online store tests mainly test for the offline and online retrieval functionality.

    • The various specific functionalities that are tested include:

      • push API tests

        • test_push_features_to_offline_store.py

        • test_push_features_to_online_store.py

        • test_offline_write.py

      • historical retrieval tests

        • test_universal_historical_retrieval.py

      • online retrieval tests

        • test_universal_online.py

      • data quality monitoring feature logging tests

        • test_feature_logging.py

      • online store tests

        • test_universal_online.py

  3. Registration Tests

    • The registration folder contains all of the registry tests and some universal cli tests. This includes:

      • CLI Apply and Materialize tests tested against on the universal test suite

      • Data type inference tests

      • Registry tests

  4. Miscellaneous Tests

    • AWS Lambda Materialization Tests (Currently do not work)

      • test_lambda.py

Unit tests

  1. Registry Diff Tests

    • These are tests for the infrastructure and registry diff functionality that Feast uses to determine if changes to the registry or infrastructure is needed.

  2. Local CLI Tests and Local Feast Tests

    • These tests test all of the cli commands against the local file offline store.

  3. Infrastructure Unit Tests

    • DynamoDB tests with dynamo mocked out

    • Repository configuration tests

    • Schema inference unit tests

    • Key serialization tests

    • Basic provider unit tests

  4. Feature Store Validation Tests

    • These test mainly contain class level validation like hashing tests, protobuf and class serialization, and error and warning handling.

      • Data source unit tests

      • Feature service unit tests

      • Feature service, feature view, and feature validation tests

      • Protobuf/json tests for Feast ValueTypes

      • Serialization tests

        • Type mapping

        • Feast types

      • Feast usage tracking unit tests

Docstring tests

Docstring tests are primarily smoke tests to make sure imports and setup functions can be executed without errors.

Understanding the test suite with an example test

Example test

Let's look at a sample test using the universal repo:

  • The key fixtures are the environment and universal_data_sources fixtures, which are defined in the feature_repos directories and the conftest.py file. This by default pulls in a standard dataset with driver and customer entities (that we have pre-defined), certain feature views, and feature values.

    • The environment fixture sets up a feature store, parametrized by the provider and the online/offline store. It allows the test to query against that feature store without needing to worry about the underlying implementation or any setup that may be involved in creating instances of these datastores.

    • Each fixture creates a different integration test with its own IntegrationTestRepoConfig which is used by pytest to generate a unique test testing one of the different environments that require testing.

  • Feast tests also use a variety of markers:

    • The @pytest.mark.integration marker is used to designate integration tests which will cause the test to be run when you call make test-python-integration.

    • The @pytest.mark.universal_offline_stores marker will parametrize the test on all of the universal offline stores including file, redshift, bigquery and snowflake.

    • The full_feature_names parametrization defines whether or not the test should reference features as their full feature name (fully qualified path) or just the feature name itself.

Writing a new test or reusing existing tests

To add a new test to an existing test file

  • Use the same function signatures as an existing test (e.g. use environment and universal_data_sources as an argument) to include the relevant test fixtures.

  • If possible, expand an individual test instead of writing a new test, due to the cost of starting up offline / online stores.

  • Use the universal_offline_stores and universal_online_store markers to parametrize the test against different offline store and online store combinations. You can also designate specific online and offline stores to test by using the only parameter on the marker.

To test a new offline / online store from a plugin repo

  • Install Feast in editable mode with pip install -e.

  • The core tests for offline / online store behavior are parametrized by the FULL_REPO_CONFIGS variable defined in feature_repos/repo_configuration.py. To overwrite this variable without modifying the Feast repo, create your own file that contains a FULL_REPO_CONFIGS (which will require adding a new IntegrationTestRepoConfig or two) and set the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. Then the core offline / online store tests can be run with make test-python-universal.

What are some important things to keep in mind when adding a new offline / online store?

Type mapping/Inference

Many problems arise when implementing your data store's type conversion to interface with Feast datatypes.

  1. You will need to correctly update inference.py so that Feast can infer your datasource schemas

  2. You also need to update type_map.py so that Feast knows how to convert your datastores types to Feast-recognized types in feast/types.py.

Historical and online retrieval

The most important functionality in Feast is historical and online retrieval. Most of the e2e and universal integration test test this functionality in some way. Making sure this functionality works also indirectly asserts that reading and writing from your datastore works as intended.

To include a new offline / online store in the main Feast repo

  • Extend data_source_creator.py for your offline store.

  • In repo_configuration.py add a new IntegrationTestRepoConfig or two (depending on how many online stores you want to test).

    • Generally, you should only need to test against sqlite. However, if you need to test against a production online store, then you can also test against Redis or dynamodb.

  • Run the full test suite with make test-python-integration.

Including a new offline / online store in the main Feast repo from external plugins with community maintainers.

  • This folder is for plugins that are officially maintained with community owners. Place the APIs in feast/infra/offline_stores/contrib/.

  • Extend data_source_creator.py for your offline store and implement the required APIs.

  • In contrib_repo_configuration.py add a new IntegrationTestRepoConfig (depending on how many online stores you want to test).

  • Run the test suite on the contrib test suite with make test-python-contrib-universal.

To include a new online store

  • In repo_configuration.py add a new config that maps to a serialized version of configuration you need in feature_store.yaml to setup the online store.

  • In repo_configuration.py, add new IntegrationTestRepoConfig for online stores you want to test.

  • Run the full test suite with make test-python-integration

To use custom data in a new test

  • Check test_universal_types.py for an example of how to do this.

Running your own Redis cluster for testing

  • Install Redis on your computer. If you are a mac user, you should be able to brew install redis.

    • Running redis-server --help and redis-cli --help should show corresponding help menus.

    • Run ./infra/scripts/redis-cluster.sh start then ./infra/scripts/redis-cluster.sh create to start the Redis cluster locally. You should see output that looks like this:

  • You should be able to run the integration tests and have the Redis cluster tests pass.

  • If you would like to run your own Redis cluster, you can run the above commands with your own specified ports and connect to the newly configured cluster.

  • To stop the cluster, run ./infra/scripts/redis-cluster.sh stop and then ./infra/scripts/redis-cluster.sh clean.

Adding a new online store

Overview

In this guide, we will show you how to integrate with MySQL as an online store. While we will be implementing a specific store, this guide should be representative for adding support for any new online store.

The process of using a custom online store consists of 6 steps:

  1. Defining the OnlineStore class.

  2. Defining the OnlineStoreConfig class.

  3. Referencing the OnlineStore in a feature repo's feature_store.yaml file.

  4. Testing the OnlineStore class.

  5. Update dependencies.

  6. Add documentation.

1. Defining an OnlineStore class

OnlineStore class names must end with the OnlineStore suffix!

Contrib online stores

New online stores go in sdk/python/feast/infra/online_stores/contrib/.

What is a contrib plugin?

  • Not guaranteed to implement all interface methods

  • Not guaranteed to be stable.

  • Should have warnings for users to indicate this is a contrib plugin that is not maintained by the maintainers.

How do I make a contrib plugin an "official" plugin?

To move an online store plugin out of contrib, you need:

  • GitHub actions (i.e make test-python-integration) is setup to run all tests against the online store and pass.

  • At least two contributors own the plugin (ideally tracked in our OWNERS / CODEOWNERS file).

The OnlineStore class broadly contains two sets of methods

  • One set deals with managing infrastructure that the online store needed for operations

  • One set deals with writing data into the store, and reading data from the store.

1.1 Infrastructure Methods

There are two methods that deal with managing infrastructure for online stores, update and teardown

  • update is invoked when users run feast apply as a CLI command, or the FeatureStore.apply() sdk method.

The update method should be used to perform any operations necessary before data can be written to or read from the store. The update method can be used to create MySQL tables in preparation for reads and writes to new feature views.

  • teardown is invoked when users run feast teardown or FeatureStore.teardown().

The teardown method should be used to perform any clean-up operations. teardown can be used to drop MySQL indices and tables corresponding to the feature views being deleted.

1.2 Read/Write Methods

There are two methods that deal with writing data to and from the online stores.online_write_batch and online_read.

  • online_write_batch is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method.

  • online_read is invoked when reading values from the online store using the FeatureStore.get_online_features() method.

2. Defining an OnlineStoreConfig class

Additional configuration may be needed to allow the OnlineStore to talk to the backing store. For example, MySQL may need configuration information like the host at which the MySQL instance is running, credentials for connecting to the database, etc.

This config class must container a type field, which contains the fully qualified class name of its corresponding OnlineStore class.

Additionally, the name of the config class must be the same as the OnlineStore class, with the Config suffix.

An example of the config class for MySQL :

This configuration can be specified in the feature_store.yaml as follows:

This configuration information is available to the methods of the OnlineStore, via theconfig: RepoConfig parameter which is passed into all the methods of the OnlineStore interface, specifically at the config.online_store field of the config parameter.

3. Using the custom online store

After implementing both these classes, the custom online store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the online_store field. The value specified should be the fully qualified class name of the OnlineStore.

As long as your OnlineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our MySQL online store, we can use the following feature_store.yaml:

If additional configuration for the online store is **not **required, then we can omit the other fields and only specify the type of the online store class as the value for the online_store.

4. Testing the OnlineStore class

4.1 Integrating with the integration test suite and unit test suite.

Even if you have created the OnlineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo.

  1. In the Feast submodule, we can run all the unit tests and make sure they pass:

  2. The universal tests, which are integration tests specifically intended to test offline and online stores, should be run against Feast to ensure that the Feast APIs works with your online store.

    • Feast parametrizes integration tests using the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py which stores different online store classes for testing.

    • To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS variable, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file.

A sample FULL_REPO_CONFIGS_MODULE looks something like this:

If you are planning to start the online store up locally(e.g spin up a local Redis Instance) for testing, then the dictionary entry should be something like:

If you are planning instead to use a Dockerized container to run your tests against your online store, you can define a OnlineStoreCreator and replace the None object above with your OnlineStoreCreator class. You should make this class available to pytest through the PYTEST_PLUGINS environment variable.

If you create a containerized docker image for testing, developers who are trying to test with your online store will not have to spin up their own instance of the online store for testing. An example of an OnlineStoreCreator is shown below:

3. Add a Makefile target to the Makefile to run your datastore specific tests by setting the FULL_REPO_CONFIGS_MODULE environment variable. Add PYTEST_PLUGINS if pytest is having trouble loading your DataSourceCreator. You can remove certain tests that are not relevant or still do not work for your datastore using the -k option.

  • If there are some tests that fail, this indicates that there is a mistake in the implementation of this online store!

5. Add Dependencies

Add any dependencies for your online store to our sdk/python/setup.py under a new <ONLINE_STORE>_REQUIRED list with the packages and add it to the setup script so that if your online store is needed, users can install the necessary python packages. These packages should be defined as extras so that they are not installed by users by default.

  • You will need to regenerate our requirements files. To do this, create separate pyenv environments for python 3.8, 3.9, and 3.10. In each environment, run the following commands:

6. Add Documentation

Remember to add the documentation for your online store.

  1. Add a new markdown file to docs/reference/online-stores/.

  2. You should also add a reference in docs/reference/online-stores/README.md and docs/SUMMARY.md. Add a new markdown document to document your online store functionality similar to how the other online stores are documented.

NOTE:Be sure to document the following things about your online store:

  • Be sure to cover how to create the datasource and what configuration is needed in the feature_store.yaml file in order to create the datasource.

  • Make sure to flag that the online store is in alpha development.

  • Add some documentation on what the data model is for the specific online store for more clarity.

  • Finally, generate the python code docs by running:

Codebase Structure

Let's examine the Feast codebase. This analysis is accurate as of Feast 0.23.

Python SDK

The Python SDK lives in sdk/python/feast. The majority of Feast logic lives in these Python files:

  • The FeatureStore class is defined in feature_store.py and the associated configuration object (the Python representation of the feature_store.yaml file) are defined in repo_config.py.

  • The CLI and other core feature store logic are defined in cli.py and repo_operations.py.

  • The type system that is used to manage conversion between Feast types and external typing systems is managed in type_map.py.

  • The Python feature server (the server that is started through the feast serve command) is defined in feature_server.py.

There are also several important submodules:

  • infra/ contains all the infrastructure components, such as the provider, offline store, online store, batch materialization engine, and registry.

  • dqm/ covers data quality monitoring, such as the dataset profiler.

  • diff/ covers the logic for determining how to apply infrastructure changes upon feature repo changes (e.g. the output of feast plan and feast apply).

  • embedded_go/ covers the Go feature server.

  • ui/ contains the embedded Web UI, to be launched on the feast ui command.

Example flow: feast apply

Let's walk through how feast apply works by tracking its execution across the codebase.

  1. All CLI commands are in cli.py. Most of these commands are backed by methods in repo_operations.py. The feast apply command triggers apply_total_command, which then calls apply_total in repo_operations.py.

  2. With a FeatureStore object (from feature_store.py) that is initialized based on the feature_store.yaml in the current working directory, apply_total first parses the feature repo with parse_repo and then calls either FeatureStore.apply or FeatureStore._apply_diffs to apply those changes to the feature store.

  3. Let's examine FeatureStore.apply. It splits the objects based on class (e.g. Entity, FeatureView, etc.) and then calls the appropriate registry method to apply or delete the object. For example, it might call self._registry.apply_entity to apply an entity. If the default file-based registry is used, this logic can be found in infra/registry/registry.py.

  4. Then the feature store must update its cloud infrastructure (e.g. online store tables) to match the new feature repo, so it calls Provider.update_infra, which can be found in infra/provider.py.

  5. Assuming the provider is a built-in provider (e.g. one of the local, GCP, or AWS providers), it will call PassthroughProvider.update_infra in infra/passthrough_provider.py.

  6. This delegates to the online store and batch materialization engine. For example, if the feature store is configured to use the Redis online store then the update method from infra/online_stores/redis.py will be called. And if the local materialization engine is configured then the update method from infra/materialization/local_engine.py will be called.

At this point, the feast apply command is complete.

Example flow: feast materialize

Let's walk through how feast materialize works by tracking its execution across the codebase.

  1. The feast materialize command triggers materialize_command in cli.py, which then calls FeatureStore.materialize from feature_store.py.

  2. This then calls Provider.materialize_single_feature_view, which can be found in infra/provider.py.

  3. As with feast apply, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.materialize_single_feature_view will be called.

  4. This delegates to the underlying batch materialization engine. Assuming that the local engine has been configured, LocalMaterializationEngine.materialize from infra/materialization/local_engine.py will be called.

  5. Since materialization involves reading features from the offline store and writing them to the online store, the local engine will delegate to both the offline store and online store. Specifically, it will call OfflineStore.pull_latest_from_table_or_query and OnlineStore.online_write_batch. These two calls will be routed to the offline store and online store that have been configured.

Example flow: get_historical_features

Let's walk through how get_historical_features works by tracking its execution across the codebase.

  1. We start with FeatureStore.get_historical_features in feature_store.py. This method does some internal preparation, and then delegates the actual execution to the underlying provider by calling Provider.get_historical_features, which can be found in infra/provider.py.

  2. As with feast apply, the provider is most likely backed by the passthrough provider, in which case PassthroughProvider.get_historical_features will be called.

  3. That call simply delegates to OfflineStore.get_historical_features. So if the feature store is configured to use Snowflake as the offline store, SnowflakeOfflineStore.get_historical_features will be executed.

Java SDK

Go feature server

The go/ directory contains the Go feature server. Most of the files here have logic to help with reading features from the online store. Within go/, the internal/feast/ directory contains most of the core logic:

  • onlineserving/ covers the core serving logic.

  • model/ contains the implementations of the Feast objects (entity, feature view, etc.).

    • For example, entity.go is the Go equivalent of entity.py. It contains a very simple Go implementation of the entity object.

  • registry/ covers the registry.

    • Currently only the file-based registry supported (the sql-based registry is unsupported). Additionally, the file-based registry only supports a file-based registry store, not the GCS or S3 registry stores.

  • onlinestore/ covers the online stores (currently only Redis and SQLite are supported).

Protobufs

Typically, changes being made to the Feast objects require changes to their corresponding protobuf representations. The usual best practices for making changes to protobufs should be followed ensure backwards and forwards compatibility.

Web UI

Adding a custom provider

Overview

All Feast operations execute through a provider. Operations like materializing data from the offline to the online store, updating infrastructure like databases, launching streaming ingestion jobs, building training datasets, and reading features from the online store.

Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include:

  • Launching custom streaming ingestion jobs (Spark, Beam)

  • Launching custom batch ingestion (materialization) jobs (Spark, Beam)

  • Adding custom validation to feature repositories during feast apply

  • Adding custom infrastructure setup logic which runs during feast apply

  • Extending Feast commands with in-house metrics, logging, or tracing

Guide

The fastest way to add custom logic to Feast is to extend an existing provider. The most generic provider is the LocalProvider which contains no cloud-specific logic. The guide that follows will extend the LocalProvider with operations that print text to the console. It is up to you as a developer to add your custom code to the provider methods, but the guide below will provide the necessary scaffolding to get you started.

Step 1: Define a Provider class

The first step is to define a custom provider class. We've created the MyCustomProvider below.

Notice how in the above provider we have only overwritten two of the methods on the LocalProvider, namely update_infra and materialize_single_feature_view. These two methods are convenient to replace if you are planning to launch custom batch or streaming jobs. update_infra can be used for launching idempotent streaming jobs, and materialize_single_feature_view can be used for launching batch ingestion jobs.

Step 2: Configuring Feast to use the provider

Notice how the provider field above points to the module and class where your provider can be found.

Step 3: Using the provider

Now you should be able to use your provider by running a Feast command:

It may also be necessary to add the module root path to your PYTHONPATH as follows:

That's it. You should now have a fully functional custom provider!

Next steps

Data sources

Structuring Feature Repos

A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set). Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.

Setting up multiple environments

There are three common ways teams approach having separate environments

  1. Have separate git branches for each environment

  2. Have separate feature_store.yaml files and separate Feast object definitions that correspond to each environment

  3. Have separate feature_store.yaml files per environment, but share the Feast object definitions

Different version control branches

To keep a clear separation of the feature repos, teams may choose to have multiple long-lived branches in their version control system, one for each environment. In this approach, with CI/CD setup, changes would first be made to the staging branch, and then copied over manually to the production branch once verified in the staging environment.

Separate feature_store.yaml files and separate Feast object definitions

The contents of this repository are shown below:

The repository contains three sub-folders:

  • staging/: This folder contains the staging feature_store.yaml and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.

  • production/: This folder contains the production feature_store.yaml and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.

  • .github: This folder is an example of a CI system that applies the changes in either the staging or production repositories using feast apply. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.

The feature_store.yaml contains the following:

Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply are tracked in the registry.db. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.

It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.

If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:

Shared Feast Object definitions with separate feature_store.yaml files

This approach is very similar to the previous approach, but instead of having feast objects duplicated and having to copy over changes, it may be possible to share the same Feast object definitions and have different feature_store.yaml configuration.

An example of how such a repository would be structured is as follows:

Users can then apply the applying them to each environment in this way:

This setup has the advantage that you can share the feature definitions entirely, which may prevent issues with copy-pasting code.

Summary

In summary, once you have set up a Git based repository with CI that runs feast apply on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.

The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.

The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).

Feast supports pluggable , that allow the materialization process to be scaled up. Aside from the local process, Feast supports a , and a .

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

Feast comes with built-in materialization engines, e.g, LocalMaterializationEngine, and an experimental LambdaMaterializationEngine. However, users can develop their own materialization engines by creating a class that implements the contract in the .

Configure your file to point to your new engine class:

The upgrade command aims to automatically modify the object definitions in a feature repo to match the API required by Feast 0.24+. When running the command, the Feast CLI analyzes the source code in the feature repo files using , and attempted to rewrite the files in a best-effort way. It's possible for there to be parts of the API that are not upgraded automatically.

Feast makes adding support for a new offline store easy. Developers can simply implement the interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).

The full working code for this guide can be found at .

(optional) offline_write_batch is a method that supports directly pushing a pyarrow table to a feature view. Given a feature view with a specific schema, this function should write the pyarrow table to the batch source defined. More details about the push api can be found . This method only needs implementation if you want to support the push api in your offline store.

To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

Users who want to have their offline store support scalable batch materialization for online use cases (detailed in this ) will also need to implement to_remote_storage to distribute the reading and writing of offline store records to blob storage (such as S3). This may be used by a custom to parallelize the materialization of data by processing it in chunks. If this is not implemented, Feast will default to local materialization (pulling all records into memory to materialize).

Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.

Feast uses an internal type system to provide guarantees on training and serving data. Feast currently supports eight primitive types - INT32, INT64, FLOAT32, FLOAT64, STRING, BYTES, BOOL, and UNIX_TIMESTAMP - and the corresponding array types. Null types are not supported, although the UNIX_TIMESTAMP type is nullable. The type system is controlled by in protobuf and by in Python. Type conversion logic can be found in .

Feast serves feature values as proto objects, which have a type corresponding to Feast types. Thus Feast must materialize feature values into the online store as Value proto objects.

As mentioned above in the section on , Feast persists feature values into the online store as Value proto objects. A call to get_online_features will return an OnlineResponse object, which essentially wraps a bunch of Value protos with some metadata. The OnlineResponse object can then be converted into a Python dictionary, which calls feast_value_type_to_python_type from type_map.py, a utility that converts the Feast internal types to Python native types.

conftest.py (in the parent directory) contains the most common , which are designed as an abstraction on top of specific offline/online stores, so tests do not need to be rewritten for different stores. Individual test files also contain more specific fixtures.

Serialization tests due to this

See the and the for examples.

Feast makes adding support for a new online store (database) easy. Developers can simply implement the interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore).

The full working code for this guide can be found at .

To facilitate configuration, all OnlineStore implementations are required to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

The core Feast objects (, , , etc.) are defined in their respective Python files, such as entity.py, feature_view.py, and data_source.py.

Of these submodules, infra/ is the most important. It contains the interfaces for the , , , , and , as well as all of their individual implementations.

The tests for the Python SDK are contained in sdk/python/tests. For more details, see this of the test suite.

The java/ directory contains the Java serving component. See for more details on how the repo is structured.

Feast uses to store serialized versions of the core Feast objects. The protobuf definitions are stored in protos/feast.

The consists of the serialized representations of the Feast objects.

The ui/ directory contains the Web UI. See for more details on the structure of the Web UI.

Feast comes with built-in providers, e.g, LocalProvider, GcpProvider, and AwsProvider. However, users can develop their own providers by creating a class that implements the contract in the .

This guide also comes with a fully functional . Please have a look at the repository for a representative example of what a custom provider looks like, or fork the repository when creating your own provider.

It is possible to overwrite all the methods on the provider class. In fact, it isn't even necessary to subclass an existing provider like LocalProvider. The only requirement for the provider class is that it follows the .

Configure your file to point to your new provider class:

Have a look at the for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider!

Please see for a conceptual explanation of data sources.

For this approach, we have created an example repository () which contains two Feast projects, one per environment.

Load data into the online store
registry
SQL based registry
Materialization Engines
Lambda-based materialization engine
Bytewax-based materialization engine
Feature Repository
Deploy a feature store
Snowflake Trial Account
Adding a new offline store
Adding a new online store
Adding a custom batch materialization engine
Adding a custom provider
BatchMaterializationEngine class
feature_store.yaml
bowler
OfflineStore
feast-dev/feast-custom-offline-store-demo
here
here
pydantic
RFC
Materialization Engine
base class
$ tree
.
├── e2e
│   ├── test_go_feature_server.py
│   ├── test_python_feature_server.py
│   ├── test_universal_e2e.py
│   ├── test_usage_e2e.py
│   └── test_validation.py
├── feature_repos
│   ├── integration_test_repo_config.py
│   ├── repo_configuration.py
│   └── universal
│       ├── catalog
│       ├── data_source_creator.py
│       ├── data_sources
│       │   ├── __init__.py
│       │   ├── bigquery.py
│       │   ├── file.py
│       │   ├── redshift.py
│       │   └── snowflake.py
│       ├── entities.py
│       ├── feature_views.py
│       ├── online_store
│       │   ├── __init__.py
│       │   ├── datastore.py
│       │   ├── dynamodb.py
│       │   ├── hbase.py
│       │   └── redis.py
│       └── online_store_creator.py
├── materialization
│   └── test_lambda.py
├── offline_store
│   ├── test_feature_logging.py
│   ├── test_offline_write.py
│   ├── test_push_features_to_offline_store.py
│   ├── test_s3_custom_endpoint.py
│   └── test_universal_historical_retrieval.py
├── online_store
│   ├── test_push_features_to_online_store.py
│   └── test_universal_online.py
└── registration
    ├── test_feature_store.py
    ├── test_inference.py
    ├── test_registry.py
    ├── test_universal_cli.py
    ├── test_universal_odfv_feature_inference.py
    └── test_universal_types.py
@pytest.mark.integration
@pytest.mark.universal_offline_stores
@pytest.mark.parametrize("full_feature_names", [True, False], ids=lambda v: f"full:{v}")
def test_historical_features(environment, universal_data_sources, full_feature_names):
    store = environment.feature_store

    (entities, datasets, data_sources) = universal_data_sources

    feature_views = construct_universal_feature_views(data_sources)

    entity_df_with_request_data = datasets.entity_df.copy(deep=True)
    entity_df_with_request_data["val_to_add"] = [
        i for i in range(len(entity_df_with_request_data))
    ]
    entity_df_with_request_data["driver_age"] = [
        i + 100 for i in range(len(entity_df_with_request_data))
    ]

    feature_service = FeatureService(
        name="convrate_plus100",
        features=[feature_views.driver[["conv_rate"]], feature_views.driver_odfv],
    )
    feature_service_entity_mapping = FeatureService(
        name="entity_mapping",
        features=[
            feature_views.location.with_name("origin").with_join_key_map(
                {"location_id": "origin_id"}
            ),
            feature_views.location.with_name("destination").with_join_key_map(
                {"location_id": "destination_id"}
            ),
        ],
    )

    store.apply(
        [
            driver(),
            customer(),
            location(),
            feature_service,
            feature_service_entity_mapping,
            *feature_views.values(),
        ]
    )
    # ... more test code

    job_from_df = store.get_historical_features(
        entity_df=entity_df_with_request_data,
        features=[
            "driver_stats:conv_rate",
            "driver_stats:avg_daily_trips",
            "customer_profile:current_balance",
            "customer_profile:avg_passenger_count",
            "customer_profile:lifetime_trip_count",
            "conv_rate_plus_100:conv_rate_plus_100",
            "conv_rate_plus_100:conv_rate_plus_100_rounded",
            "conv_rate_plus_100:conv_rate_plus_val_to_add",
            "order:order_is_success",
            "global_stats:num_rides",
            "global_stats:avg_ride_length",
            "field_mapping:feature_name",
        ],
        full_feature_names=full_feature_names,
    )

    if job_from_df.supports_remote_storage_export():
        files = job_from_df.to_remote_storage()
        print(files)
        assert len(files) > 0  # This test should be way more detailed

    start_time = datetime.utcnow()
    actual_df_from_df_entities = job_from_df.to_df()
    # ... more test code

    validate_dataframes(
        expected_df,
        table_from_df_entities,
        sort_by=[event_timestamp, "order_id", "driver_id", "customer_id"],
        event_timestamp = event_timestamp,
    )
    # ... more test code
@pytest.mark.universal_online_stores(only=["redis"])
@pytest.mark.integration
def your_test(environment: Environment):
    df = #...#
    data_source = environment.data_source_creator.create_data_source(
        df,
        destination_name=environment.feature_store.project
    )
    your_fv = driver_feature_view(data_source)
    entity = driver(value_type=ValueType.UNKNOWN)
    fs.apply([fv, entity])

    # ... run test
Starting 6001
Starting 6002
Starting 6003
Starting 6004
Starting 6005
Starting 6006
feast_custom_online_store/mysql.py
# Only prints out runtime warnings once.
warnings.simplefilter("once", RuntimeWarning)

def update(
    self,
    config: RepoConfig,
    tables_to_delete: Sequence[Union[FeatureTable, FeatureView]],
    tables_to_keep: Sequence[Union[FeatureTable, FeatureView]],
    entities_to_delete: Sequence[Entity],
    entities_to_keep: Sequence[Entity],
    partial: bool,
):
    """
    An example of creating managing the tables needed for a mysql-backed online store.
    """
    warnings.warn(
        "This online store is an experimental feature in alpha development. "
        "Some functionality may still be unstable so functionality can change in the future.",
        RuntimeWarning,
    )
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    project = config.project

    for table in tables_to_keep:
        cur.execute(
            f"CREATE TABLE IF NOT EXISTS {_table_id(project, table)} (entity_key VARCHAR(512), feature_name VARCHAR(256), value BLOB, event_ts timestamp, created_ts timestamp,  PRIMARY KEY(entity_key, feature_name))"
        )
        cur.execute(
            f"CREATE INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)} (entity_key);"
        )

    for table in tables_to_delete:
        cur.execute(
            f"DROP INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)};"
        )
        cur.execute(f"DROP TABLE IF EXISTS {_table_id(project, table)}")


def teardown(
    self,
    config: RepoConfig,
    tables: Sequence[Union[FeatureTable, FeatureView]],
    entities: Sequence[Entity],
):
    warnings.warn(
        "This online store is an experimental feature in alpha development. "
        "Some functionality may still be unstable so functionality can change in the future.",
        RuntimeWarning,
    )
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)
    project = config.project

    for table in tables:
        cur.execute(
            f"DROP INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)};"
        )
        cur.execute(f"DROP TABLE IF EXISTS {_table_id(project, table)}")
feast_custom_online_store/mysql.py
# Only prints out runtime warnings once.
warnings.simplefilter("once", RuntimeWarning)

def online_write_batch(
    self,
    config: RepoConfig,
    table: Union[FeatureTable, FeatureView],
    data: List[
        Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
    ],
    progress: Optional[Callable[[int], Any]],
) -> None:
    warnings.warn(
        "This online store is an experimental feature in alpha development. "
        "Some functionality may still be unstable so functionality can change in the future.",
        RuntimeWarning,
    )
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    project = config.project

    for entity_key, values, timestamp, created_ts in data:
        entity_key_bin = serialize_entity_key(
            entity_key,
            entity_key_serialization_version=config.entity_key_serialization_version,
        ).hex()
        timestamp = _to_naive_utc(timestamp)
        if created_ts is not None:
            created_ts = _to_naive_utc(created_ts)

        for feature_name, val in values.items():
            self.write_to_table(created_ts, cur, entity_key_bin, feature_name, project, table, timestamp, val)
        self._conn.commit()
        if progress:
            progress(1)

def online_read(
    self,
    config: RepoConfig,
    table: Union[FeatureTable, FeatureView],
    entity_keys: List[EntityKeyProto],
    requested_features: Optional[List[str]] = None,
) -> List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]]:
    warnings.warn(
        "This online store is an experimental feature in alpha development. "
        "Some functionality may still be unstable so functionality can change in the future.",
        RuntimeWarning,
    )
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    result: List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]] = []

    project = config.project
    for entity_key in entity_keys:
        entity_key_bin = serialize_entity_key(
            entity_key,
            entity_key_serialization_version=config.entity_key_serialization_version,
        ).hex()
        print(f"entity_key_bin: {entity_key_bin}")

        cur.execute(
            f"SELECT feature_name, value, event_ts FROM {_table_id(project, table)} WHERE entity_key = %s",
            (entity_key_bin,),
        )

        res = {}
        res_ts = None
        for feature_name, val_bin, ts in cur.fetchall():
            val = ValueProto()
            val.ParseFromString(val_bin)
            res[feature_name] = val
            res_ts = ts

        if not res:
            result.append((None, None))
        else:
            result.append((res_ts, res))
    return result
feast_custom_online_store/mysql.py
class MySQLOnlineStoreConfig(FeastConfigBaseModel):
    type: Literal["feast_custom_online_store.mysql.MySQLOnlineStore"] = "feast_custom_online_store.mysql.MySQLOnlineStore"

    host: Optional[StrictStr] = None
    user: Optional[StrictStr] = None
    password: Optional[StrictStr] = None
    database: Optional[StrictStr] = None
feature_repo/feature_store.yaml
online_store:
    type: feast_custom_online_store.mysql.MySQLOnlineStore
    user: foo
    password: bar
feast_custom_online_store/mysql.py
def online_write_batch(
        self,
        config: RepoConfig,
        table: Union[FeatureTable, FeatureView],
        data: List[
            Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
        ],
        progress: Optional[Callable[[int], Any]],
) -> None:

    online_store_config = config.online_store
    assert isinstance(online_store_config, MySQLOnlineStoreConfig)

    connection = mysql.connector.connect(
        host=online_store_config.host or "127.0.0.1",
        user=online_store_config.user or "root",
        password=online_store_config.password,
        database=online_store_config.database or "feast",
        autocommit=True
    )
feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
online_store:
    # Make sure to specify the type as the fully qualified path that Feast can import.
    type: feast_custom_online_store.mysql.MySQLOnlineStore
    user: foo
    password: bar
feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
online_store: feast_custom_online_store.mysql.MySQLOnlineStore
make test-python
sdk/python/feast/infra/online_stores/contrib/postgres_repo_configuration.py
from feast.infra.offline_stores.contrib.postgres_offline_store.tests.data_source import (
    PostgreSQLDataSourceCreator,
)

AVAILABLE_ONLINE_STORES = {"postgres": (None, PostgreSQLDataSourceCreator)}
{
    "sqlite": ({"type": "sqlite"}, None),
    # Specifies sqlite as the online store. The `None` object specifies to not use a containerized docker container.
}
sdk/python/tests/integration/feature_repos/universal/online_store/redis.py
class RedisOnlineStoreCreator(OnlineStoreCreator):
    def __init__(self, project_name: str, **kwargs):
        super().__init__(project_name)

    def create_online_store(self) -> Dict[str, str]:
        self.container.start()
        log_string_to_wait_for = "Ready to accept connections"
        wait_for_logs(
            container=self.container, predicate=log_string_to_wait_for, timeout=10
        )
        self.container.stop()
Makefile
test-python-universal-cassandra:
	PYTHONPATH='.' \
	FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.online_stores.contrib.cassandra_repo_configuration \
	PYTEST_PLUGINS=sdk.python.tests.integration.feature_repos.universal.online_store.cassandra \
	FEAST_USAGE=False \
	IS_TEST=True \
	python -m pytest -x --integration \
	sdk/python/tests
export PYTHON=<version>
make lock-python-ci-dependencies
make build-sphinx
$ tree -L 1 -d
.
├── docs
├── examples
├── go
├── infra
├── java
├── protos
├── sdk
└── ui
$ tree --dirsfirst -L 1 infra   
infra
├── contrib
├── feature_servers
├── materialization
├── offline_stores
├── online_stores
├── registry
├── transformation_servers
├── utils
├── __init__.py
├── aws.py
├── gcp.py
├── infra_object.py
├── key_encoding_utils.py
├── local.py
├── passthrough_provider.py
└── provider.py
from datetime import datetime
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union

from feast.entity import Entity
from feast.feature_table import FeatureTable
from feast.feature_view import FeatureView
from feast.infra.local import LocalProvider
from feast.infra.offline_stores.offline_store import RetrievalJob
from feast.protos.feast.types.EntityKey_pb2 import EntityKey as EntityKeyProto
from feast.protos.feast.types.Value_pb2 import Value as ValueProto
from feast.infra.registry.registry import Registry
from feast.repo_config import RepoConfig


class MyCustomProvider(LocalProvider):
    def __init__(self, config: RepoConfig, repo_path):
        super().__init__(config)
        # Add your custom init code here. This code runs on every Feast operation.

    def update_infra(
        self,
        project: str,
        tables_to_delete: Sequence[Union[FeatureTable, FeatureView]],
        tables_to_keep: Sequence[Union[FeatureTable, FeatureView]],
        entities_to_delete: Sequence[Entity],
        entities_to_keep: Sequence[Entity],
        partial: bool,
    ):
        super().update_infra(
            project,
            tables_to_delete,
            tables_to_keep,
            entities_to_delete,
            entities_to_keep,
            partial,
        )
        print("Launching custom streaming jobs is pretty easy...")

    def materialize_single_feature_view(
        self,
        config: RepoConfig,
        feature_view: FeatureView,
        start_date: datetime,
        end_date: datetime,
        registry: Registry,
        project: str,
        tqdm_builder: Callable[[int], tqdm],
    ) -> None:
        super().materialize_single_feature_view(
            config, feature_view, start_date, end_date, registry, project, tqdm_builder
        )
        print("Launching custom batch jobs is pretty easy...")
project: repo
registry: registry.db
provider: feast_custom_provider.custom_provider.MyCustomProvider
online_store:
    type: sqlite
    path: online_store.db
offline_store:
    type: file
feast apply
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats
Launching custom streaming jobs is pretty easy...
PYTHONPATH=$PYTHONPATH:/home/my_user/my_custom_provider feast apply
├── .github
│   └── workflows
│       ├── production.yml
│       └── staging.yml
│
├── staging
│   ├── driver_repo.py
│   └── feature_store.yaml
│
└── production
    ├── driver_repo.py
    └── feature_store.yaml
project: staging
registry: gs://feast-ci-demo-registry/staging/registry.db
provider: gcp
└── production
    ├── common
    │    ├── __init__.py
    │    ├── sources.py
    │    └── entities.py
    ├── ranking
    │    ├── __init__.py
    │    ├── views.py
    │    └── transformations.py
    ├── segmentation
    │    ├── __init__.py
    │    ├── views.py
    │    └── transformations.py
    └── feature_store.yaml
├── .github
│   └── workflows
│       ├── production.yml
│       └── staging.yml
├── staging
│   └── feature_store.yaml
├── production
│   └── feature_store.yaml
└── driver_repo.py
feast -f staging/feature_store.yaml apply

Azure Synapse + Azure SQL (contrib)

Description

MsSQL data sources are Microsoft sql table sources. These can be specified either by a table reference or a SQL query.

Disclaimer

The MsSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a MsSQL source:

from feast.infra.offline_stores.contrib.mssql_offline_store.mssqlserver_source import (
    MsSqlServerSource,
)

driver_hourly_table = "driver_hourly"

driver_source = MsSqlServerSource(
    table_ref=driver_hourly_table,
    event_timestamp_column="datetime",
    created_timestamp_column="created",
)
Value.proto
types.py
type_map.py
Value
fixtures
issue
custom offline store demo
custom online store demo
OnlineStore
feast-dev/feast-custom-online-store-demo
here
pydantic
entities
feature views
data sources
provider
offline store
online store
batch materialization engine
registry
here
protobuf
registry
here
Provider class
custom provider demo repository
Provider contract
feature_store.yaml
custom provider demo repository
Data Source
Overview
File
Snowflake
BigQuery
Redshift
Push
Kafka
Kinesis
Spark (contrib)
PostgreSQL (contrib)
Trino (contrib)
Azure Synapse + Azure SQL (contrib)
Feast Repository Example
materialization
overview

Snowflake

Description

Snowflake data sources are Snowflake tables or views. These can be specified either by a table reference or a SQL query.

Examples

Using a table reference:

from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    database="FEAST",
    schema="PUBLIC",
    table="FEATURE_TABLE",
)

Using a query:

from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    query="""
    SELECT
        timestamp_column AS "ts",
        "created",
        "f1",
        "f2"
    FROM
        `FEAST.PUBLIC.FEATURE_TABLE`
      """,
)

Supported Types

File

Description

File data sources are files on disk or on S3. Currently only Parquet files are supported.

FileSource is meant for development purposes only and is not optimized for production use.

Example

from feast import FileSource
from feast.data_format import ParquetFormat

parquet_file_source = FileSource(
    file_format=ParquetFormat(),
    path="file:///feast/customer.parquet",
)

Supported Types

Overview

Functionality

In Feast, each batch data source is associated with a corresponding offline store. For example, a SnowflakeSource can only be processed by the Snowflake offline store. Otherwise, the primary difference between batch data sources is the set of supported types. Feast has an internal type system, and aims to support eight primitive types (bytes, string, int32, int64, float32, float64, bool, and timestamp) along with the corresponding array types. However, not every batch data source supports all of these types.

Functionality Matrix

Below is a matrix indicating which data sources support which types.

File
BigQuery
Snowflake
Redshift
Postgres
Spark
Trino

bytes

yes

yes

yes

yes

yes

yes

yes

string

yes

yes

yes

yes

yes

yes

yes

int32

yes

yes

yes

yes

yes

yes

yes

int64

yes

yes

yes

yes

yes

yes

yes

float32

yes

yes

yes

yes

yes

yes

yes

float64

yes

yes

yes

yes

yes

yes

yes

bool

yes

yes

yes

yes

yes

yes

yes

timestamp

yes

yes

yes

yes

yes

yes

yes

array types

yes

yes

no

no

yes

yes

no

Push

Description

Push sources can be used by multiple feature views. When data is pushed to a push source, Feast propagates the feature values to all the consuming feature views.

Push sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for pushing data to a batch data source such as a data warehouse table. When using a push source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

  1. Raw events come in (stream 1)

  2. Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)

  3. Write stream 2 values to an offline store as a historical log for training (optional)

  4. Write stream 2 values to an online store for low latency feature serving

  5. Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance

Feast allows users to push features previously registered in a feature view to the online store for fresher features. It also allows users to push batches of stream data to the offline store by specifying that the push be directed to the offline store. This will push the data to the offline store declared in the repository configuration used to initialize the feature store.

Example (basic)

Defining a push source

Note that the push schema needs to also include the entity.

from feast import Entity, PushSource, ValueType, BigQuerySource, FeatureView, Feature, Field
from feast.types import Int64

push_source = PushSource(
    name="push_source",
    batch_source=BigQuerySource(table="test.test"),
)

user = Entity(name="user", join_keys=["user_id"])

fv = FeatureView(
    name="feature view",
    entities=[user],
    schema=[Field(name="life_time_value", dtype=Int64)],
    source=push_source,
)

Pushing data

Note that the to parameter is optional and defaults to online but we can specify these options: PushMode.ONLINE, PushMode.OFFLINE, or PushMode.ONLINE_AND_OFFLINE.

from feast import FeatureStore
import pandas as pd
from feast.data_source import PushMode

fs = FeatureStore(...)
feature_data_frame = pd.DataFrame()
fs.push("push_source_name", feature_data_frame, to=PushMode.ONLINE_AND_OFFLINE)

Example (Spark Streaming)

The default option to write features from a stream is to add the Python SDK into your existing PySpark pipeline.

from feast import FeatureStore

store = FeatureStore(...)

spark = SparkSession.builder.getOrCreate()

streamingDF = spark.readStream.format(...).load()

def feast_writer(spark_df):
    pandas_df = spark_df.to_pandas()
    store.push("driver_hourly_stats", pandas_df)

streamingDF.writeStream.foreachBatch(feast_writer).start()

Redshift

Description

Redshift data sources are Redshift tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table name:

from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    table="redshift_table",
)

Using a query:

from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM redshift_table",
)

Supported Types

BigQuery

Description

BigQuery data sources are BigQuery tables or views. These can be specified either by a table reference or a SQL query. However, no performance guarantees can be provided for SQL query-based sources, so table references are recommended.

Examples

Using a table reference:

from feast import BigQuerySource

my_bigquery_source = BigQuerySource(
    table_ref="gcp_project:bq_dataset.bq_table",
)

Using a query:

from feast import BigQuerySource

BigQuerySource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM `my_project.my_dataset.my_features`",
)

Supported Types

PostgreSQL (contrib)

Description

PostgreSQL data sources are PostgreSQL tables or views. These can be specified either by a table reference or a SQL query.

Disclaimer

The PostgreSQL data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Postgres source:

from feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source import (
    PostgreSQLSource,
)

driver_stats_source = PostgreSQLSource(
    name="feast_driver_hourly_stats",
    query="SELECT * FROM feast_driver_hourly_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

Supported Types

Kafka

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Description

Kafka sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kafka streams to a batch data source such as a data warehouse table. When using a Kafka source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

  1. Raw events come in (stream 1)

  2. Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)

  3. Write stream 2 values to an offline store as a historical log for training (optional)

  4. Write stream 2 values to an online store for low latency feature serving

  5. Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance

Example

Defining a Kafka source

Note that the Kafka source has a batch source.

from datetime import timedelta

from feast import Field, FileSource, KafkaSource, stream_feature_view
from feast.data_format import JsonFormat
from feast.types import Float32

driver_stats_batch_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
)

driver_stats_stream_source = KafkaSource(
    name="driver_stats_stream",
    kafka_bootstrap_servers="localhost:9092",
    topic="drivers",
    timestamp_field="event_timestamp",
    batch_source=driver_stats_batch_source,
    message_format=JsonFormat(
        schema_json="driver_id integer, event_timestamp timestamp, conv_rate double, acc_rate double, created timestamp"
    ),
    watermark_delay_threshold=timedelta(minutes=5),
)

Using the Kafka source in a stream feature view

The Kafka source can be used in a stream feature view.

@stream_feature_view(
    entities=[driver],
    ttl=timedelta(seconds=8640000000),
    mode="spark",
    schema=[
        Field(name="conv_percentage", dtype=Float32),
        Field(name="acc_percentage", dtype=Float32),
    ],
    timestamp_field="event_timestamp",
    online=True,
    source=driver_stats_stream_source,
)
def driver_hourly_stats_stream(df: DataFrame):
    from pyspark.sql.functions import col

    return (
        df.withColumn("conv_percentage", col("conv_rate") * 100.0)
        .withColumn("acc_percentage", col("acc_rate") * 100.0)
        .drop("conv_rate", "acc_rate")
    )

Ingesting data

Trino (contrib)

Description

Trino data sources are Trino tables or views. These can be specified either by a table reference or a SQL query.

Disclaimer

The Trino data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Defining a Trino source:

from feast.infra.offline_stores.contrib.trino_offline_store.trino_source import (
    TrinoSource,
)

driver_hourly_stats = TrinoSource(
    event_timestamp_column="event_timestamp",
    table_ref="feast.driver_stats",
    created_timestamp_column="created",
)

Supported Types

Kinesis

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Description

Kinesis sources must have a batch source specified. The batch source will be used for retrieving historical features. Thus users are also responsible for writing data from their Kinesis streams to a batch data source such as a data warehouse table. When using a Kinesis source as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the feature view definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

  1. Raw events come in (stream 1)

  2. Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)

  3. Write stream 2 values to an offline store as a historical log for training (optional)

  4. Write stream 2 values to an online store for low latency feature serving

  5. Periodically materialize feature values from the offline store into the online store for decreased training-serving skew and improved model performance

Example

Defining a Kinesis source

Note that the Kinesis source has a batch source.

from datetime import timedelta

from feast import Field, FileSource, KinesisSource, stream_feature_view
from feast.data_format import JsonFormat
from feast.types import Float32

driver_stats_batch_source = FileSource(
    name="driver_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
)

driver_stats_stream_source = KinesisSource(
    name="driver_stats_stream",
    stream_name="drivers",
    timestamp_field="event_timestamp",
    batch_source=driver_stats_batch_source,
    record_format=JsonFormat(
        schema_json="driver_id integer, event_timestamp timestamp, conv_rate double, acc_rate double, created timestamp"
    ),
    watermark_delay_threshold=timedelta(minutes=5),
)

Using the Kinesis source in a stream feature view

The Kinesis source can be used in a stream feature view.

@stream_feature_view(
    entities=[driver],
    ttl=timedelta(seconds=8640000000),
    mode="spark",
    schema=[
        Field(name="conv_percentage", dtype=Float32),
        Field(name="acc_percentage", dtype=Float32),
    ],
    timestamp_field="event_timestamp",
    online=True,
    source=driver_stats_stream_source,
)
def driver_hourly_stats_stream(df: DataFrame):
    from pyspark.sql.functions import col

    return (
        df.withColumn("conv_percentage", col("conv_rate") * 100.0)
        .withColumn("acc_percentage", col("acc_rate") * 100.0)
        .drop("conv_rate", "acc_rate")
    )

Ingesting data

Spark (contrib)

Description

Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query.

Disclaimer

The Spark data source does not achieve full test coverage. Please do not assume complete stability.

Examples

Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore):

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    table="FEATURE_TABLE",
)

Using a query:

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM spark_table",
)

Using a file reference:

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    path=f"{CURRENT_DIR}/data/driver_hourly_stats",
    file_format="parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

Supported Types

Offline stores

Redshift

Description

  • All joins happen within Redshift.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Redshift temporarily in order to complete join operations.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[aws]'. You can get started by then running feast init -t aws.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by RedshiftRetrievalJob.

Permissions

Feast requires the following permissions in order to execute commands for Redshift offline store:

The following inline policy can be used to grant Feast the necessary permissions:

The following inline policy can be used to grant Redshift necessary permissions to access S3:

While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:

Overview

Functionality

Here are the methods exposed by the OfflineStore interface, along with the core functionality supported by the method:

  • get_historical_features: point-in-time correct join to retrieve historical features

  • pull_latest_from_table_or_query: retrieve latest feature values for materialization into the online store

  • pull_all_from_table_or_query: retrieve a saved dataset

  • offline_write_batch: persist dataframes to the offline store, primarily for push sources

  • write_logged_features: persist logged features to the offline store, for feature logging

The first three of these methods all return a RetrievalJob specific to an offline store, such as a SnowflakeRetrievalJob. Here is a list of functionality supported by RetrievalJobs:

  • export to dataframe

  • export to arrow table

  • export to arrow batches (to handle large datasets in memory)

  • export to SQL

  • export to data lake (S3, GCS, etc.)

  • export to data warehouse

  • export as Spark dataframe

  • local execution of Python-based on-demand transforms

  • remote execution of Python-based on-demand transforms

  • persist results in the offline store

  • preview the query plan before execution (RetrievalJobs are lazily executed)

  • read partitioned data

Functionality Matrix

Below is a matrix indicating which offline stores support which methods.

Below is a matrix indicating which RetrievalJobs support what functionality.

Spark (contrib)

Description

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.

Disclaimer

The Spark offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[spark]'. You can get started by then running feast init -t spark.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by SparkRetrievalJob.

PostgreSQL (contrib)

Description

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Postgres as a table in order to complete join operations.

Disclaimer

The PostgreSQL offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[postgres]'. You can get started by then running feast init -t postgres.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by PostgreSQLRetrievalJob.

BigQuery

Description

  • All joins happen within BigQuery.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to BigQuery as a table (marked for expiration) in order to complete join operations.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[gcp]'. You can get started by then running feast init -t gcp.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by BigQueryRetrievalJob.

Trino (contrib)

Description

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Trino as a table in order to complete join operations.

Disclaimer

The Trino offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[trino]'. You can then run feast init, then swap out feature_store.yaml with the below example to connect to Trino.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by TrinoRetrievalJob.

Azure Synapse + Azure SQL (contrib)

Description

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe.

Getting started

Disclaimer

The MsSQL offline store does not achieve full test coverage. Please do not assume complete stability.

Example

Functionality Matrix

Below is a matrix indicating which functionality is supported by MsSqlServerRetrievalJob.

Run in Google Colab
View Source on Github
View Source on Github

Be careful about how Snowflake handles table and column name conventions. In particular, you can read more about quote identifiers .

The full set of configuration options is available .

Snowflake data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

The full set of configuration options is available .

File data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

For more details on the Feast type system, see .

There are currently four core batch data source implementations: FileSource, BigQuerySource, SnowflakeSource, and RedshiftSource. There are several additional implementations contributed by the Feast community (PostgreSQLSource, SparkSource, and TrinoSource), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific data source can be found .

Push sources allow feature values to be pushed to the online store and offline store in real time. This allows fresh feature values to be made available to applications. Push sources supercede the .

See also for instructions on how to push data to a deployed feature server.

This can also be used under the hood by a contrib stream processor (see )

The full set of configuration options is available .

Redshift data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

The full set of configuration options is available .

BigQuery data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

The full set of configuration options is available .

PostgreSQL data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Kafka sources allow users to register Kafka streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kafka. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through . An example of how to launch such a job with Spark can be found . Feast also provides functionality to write to the offline store using the write_to_offline_store functionality.

See for a example of how to ingest data from a Kafka source into Feast.

The full set of configuration options is available .

Trino data sources support all eight primitive types, but currently do not support array types. For a comparison against other batch data sources, please see .

Kinesis sources allow users to register Kinesis streams as data sources. Feast currently does not launch or monitor jobs to ingest data from Kinesis. Users are responsible for launching and monitoring their own ingestion jobs, which should write feature values to the online store through . An example of how to launch such a job with Spark to ingest from Kafka can be found ; by using a different plugin, the example can be adapted to Kinesis. Feast also provides functionality to write to the offline store using the write_to_offline_store functionality.

See for a example of how to ingest data from a Kafka source into Feast. The approach used in the tutorial can be easily adapted to work for Kinesis as well.

The full set of configuration options is available .

Spark data sources support all eight primitive types and their corresponding array types. For a comparison against other batch data sources, please see .

Please see for a conceptual explanation of offline stores.

The Redshift offline store provides support for reading .

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Redshift offline store.

Redshift
Redshift

To compare this set of functionality against other offline stores, please see the full .

In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run and commands. Once created, this IAM role needs to be configured in feature_store.yaml file as offline_store: iam_role.

There are currently four core offline store implementations: FileOfflineStore, BigQueryOfflineStore, SnowflakeOfflineStore, and RedshiftOfflineStore. There are several additional implementations contributed by the Feast community (PostgreSQLOfflineStore, SparkOfflineStore, and TrinoOfflineStore), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific offline store, such as how to configure it in a feature_store.yaml, can be found .

File
BigQuery
Snowflake
Redshift
Postgres
Spark
Trino
File
BigQuery
Snowflake
Redshift
Postgres
Spark
Trino

The Spark offline store provides support for reading .

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.

Spark
Spark

To compare this set of functionality against other offline stores, please see the full .

The PostgreSQL offline store provides support for reading .

Note that sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional parameters. The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the PostgreSQL offline store.

Postgres
Postgres

To compare this set of functionality against other offline stores, please see the full .

The BigQuery offline store provides support for reading .

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the BigQuery offline store.

BigQuery
BigQuery

*See for details on proposed solutions for enabling the BigQuery offline store to understand tables that use _PARTITIONTIME as the partition column.

To compare this set of functionality against other offline stores, please see the full .

The Trino offline store provides support for reading .

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Trino offline store.

Trino
Trino

To compare this set of functionality against other offline stores, please see the full .

The MsSQL offline store provides support for reading . Specifically, it is developed to read from on Microsoft Azure

In order to use this offline store, you'll need to run pip install 'feast[azure]'. You can get started by then following this .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.

MsSql
MsSql

To compare this set of functionality against other offline stores, please see the full .

here
here
here
here
here
FeatureStore.write_to_online_store
Python feature server
Tutorial: Building streaming features
here
here
here
FeatureStore.write_to_online_store
here
here
here
FeatureStore.write_to_online_store
here
here
here
here
here
here
here
here
here
here
feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
offline_store:
  type: redshift
  region: us-west-2
  cluster_id: feast-cluster
  database: feast-database
  user: redshift-user
  s3_staging_location: s3://feast-bucket/redshift
  iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

export to dataframe

yes

export to arrow table

yes

export to arrow batches

yes

export to SQL

yes

export to data lake (S3, GCS, etc.)

no

export to data warehouse

yes

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

Command

Permissions

Resources

Apply

redshift-data:DescribeTable

redshift:GetClusterCredentials

arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Materialize

redshift-data:ExecuteStatement

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Materialize

redshift-data:DescribeStatement

*

Materialize

s3:ListBucket

s3:GetObject

s3:DeleteObject

arn:aws:s3:::<bucket_name>

arn:aws:s3:::<bucket_name>/*

Get Historical Features

redshift-data:ExecuteStatement

redshift:GetClusterCredentials

arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Get Historical Features

redshift-data:DescribeStatement

*

Get Historical Features

s3:ListBucket

s3:GetObject

s3:PutObject

s3:DeleteObject

arn:aws:s3:::<bucket_name>

arn:aws:s3:::<bucket_name>/*

{
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<bucket_name>/*",
                "arn:aws:s3:::<bucket_name>"
            ]
        },
        {
            "Action": [
                "redshift-data:DescribeTable",
                "redshift:GetClusterCredentials",
                "redshift-data:ExecuteStatement"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>",
                "arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>",
                "arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>"
            ]
        },
        {
            "Action": [
                "redshift-data:DescribeStatement"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}
{
    "Statement": [
        {
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::feast-integration-tests",
                "arn:aws:s3:::feast-integration-tests/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

get_historical_features

yes

yes

yes

yes

yes

yes

yes

pull_latest_from_table_or_query

yes

yes

yes

yes

yes

yes

yes

pull_all_from_table_or_query

yes

yes

yes

yes

yes

yes

yes

offline_write_batch

yes

yes

yes

yes

no

no

no

write_logged_features

yes

yes

yes

yes

no

no

no

export to dataframe

yes

yes

yes

yes

yes

yes

yes

export to arrow table

yes

yes

yes

yes

yes

yes

yes

export to arrow batches

no

no

no

yes

no

no

no

export to SQL

no

yes

no

yes

yes

no

yes

export to data lake (S3, GCS, etc.)

no

no

yes

no

yes

no

no

export to data warehouse

no

yes

yes

yes

yes

no

no

export as Spark dataframe

no

no

no

no

no

yes

no

local execution of Python-based on-demand transforms

yes

yes

yes

yes

yes

no

yes

remote execution of Python-based on-demand transforms

no

no

no

no

no

no

no

persist results in the offline store

yes

yes

yes

yes

yes

yes

no

preview the query plan before execution

yes

yes

yes

yes

yes

yes

yes

read partitioned data

yes

yes

yes

yes

yes

yes

yes

feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: spark
    spark_conf:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.sql.parser.quotedRegexColumnNames: "true"
        spark.sql.session.timeZone: "UTC"
online_store:
    path: data/online_store.db

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

no

export to data lake (S3, GCS, etc.)

no

export to data warehouse

no

export as Spark dataframe

yes

local execution of Python-based on-demand transforms

no

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
  type: postgres
  host: DB_HOST
  port: DB_PORT
  database: DB_NAME
  db_schema: DB_SCHEMA
  user: DB_USERNAME
  password: DB_PASSWORD
  sslmode: verify-ca
  sslkey_path: /path/to/client-key.pem
  sslcert_path: /path/to/client-cert.pem
  sslrootcert_path: /path/to/server-ca.pem
online_store:
    path: data/online_store.db

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

yes

export to data warehouse

yes

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

feature_store.yaml
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp
offline_store:
  type: bigquery
  dataset: feast_bq_dataset

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

no

export to data warehouse

yes

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data*

partial

feature_store.yaml
project: feature_repo
registry: data/registry.db
provider: local
offline_store:
    type: feast_trino.trino.TrinoOfflineStore
    host: localhost
    port: 8080
    catalog: memory
    connector:
        type: memory
online_store:
    path: data/online_store.db

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

no

export to data warehouse

no

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

no

preview the query plan before execution

yes

read partitioned data

yes

feature_store.yaml
registry:
  registry_store_type: AzureRegistryStore
  path: ${REGISTRY_PATH} # Environment Variable
project: production
provider: azure
online_store:
    type: redis
    connection_string: ${REDIS_CONN} # Environment Variable
offline_store:
    type: mssql
    connection_string: ${SQL_CONN}  # Environment Variable

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

no

export to data lake (S3, GCS, etc.)

no

export to data warehouse

no

local execution of Python-based on-demand transforms

no

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

Google Cloud Platform

Description

  • Offline Store: Uses the BigQuery offline store by default. Also supports File as the offline store.

  • Online Store: Uses the Datastore online store by default. Also supports Sqlite as an online store.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[gcp]'. You can get started by then running feast init -t gcp.

Example

feature_store.yaml
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp

Permissions

Command

Component

Permissions

Recommended Role

Apply

BigQuery (source)

bigquery.jobs.create

bigquery.readsessions.create

bigquery.readsessions.getData

roles/bigquery.user

Apply

Datastore (destination)

datastore.entities.allocateIds

datastore.entities.create

datastore.entities.delete

datastore.entities.get

datastore.entities.list

datastore.entities.update

roles/datastore.owner

Materialize

BigQuery (source)

bigquery.jobs.create

roles/bigquery.user

Materialize

Datastore (destination)

datastore.entities.allocateIds

datastore.entities.create

datastore.entities.delete

datastore.entities.get

datastore.entities.list

datastore.entities.update

datastore.databases.get

roles/datastore.owner

Get Online Features

Datastore

datastore.entities.get

roles/datastore.user

Get Historical Features

BigQuery (source)

bigquery.datasets.get

bigquery.tables.get

bigquery.tables.create

bigquery.tables.updateData

bigquery.tables.update

bigquery.tables.delete

bigquery.tables.getData

roles/bigquery.dataEditor

Local

Description

  • Offline Store: Uses the File offline store by default. Also supports BigQuery as the offline store.

  • Online Store: Uses the Sqlite online store by default. Also supports Redis and Datastore as online stores.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local

Amazon Web Services

Description

  • Offline Store: Uses the Redshift offline store by default. Also supports File as the offline store.

  • Online Store: Uses the DynamoDB online store by default. Also supports Sqlite as an online store.

Getting started

In order to use this offline store, you'll need to run (Snowflake) pip install 'feast[aws, snowflake]' or (Redshift) pip install 'feast[aws]'.

You can get started by then running feast init -t snowflake or feast init -t aws.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
online_store:
  type: dynamodb
  region: us-west-2
offline_store:
  type: redshift
  region: us-west-2
  cluster_id: feast-cluster
  database: feast-database
  user: redshift-user
  s3_staging_location: s3://feast-bucket/redshift
  iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role
Offline Store
Overview
File
Snowflake
BigQuery
Redshift
Spark (contrib)
PostgreSQL (contrib)
Trino (contrib)
Azure Synapse + Azure SQL (contrib)
RedshiftSources
RedshiftOfflineStoreConfig
UNLOAD
COPY
here
SparkSources
SparkOfflineStoreConfig
PostgreSQLSources
PostgreSQLOfflineStoreConfig
BigQuerySources
BigQueryOfflineStoreConfig
GitHub issue
TrinoSources
TrinoOfflineStoreConfig
MsSQL Sources
Synapse SQL
tutorial
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix

Overview

Functionality

Here are the methods exposed by the OnlineStore interface, along with the core functionality supported by the method:

  • online_write_batch: write feature values to the online store

  • online_read: read feature values from the online store

  • update: update infrastructure (e.g. tables) in the online store

  • teardown: teardown infrastructure (e.g. tables) in the online store

  • plan: generate a plan of infrastructure changes based on feature repo changes

There is also additional functionality not properly captured by these interface methods:

  • support for on-demand transforms

  • readable by Python SDK

  • readable by Java

  • readable by Go

  • support for entityless feature views

  • support for concurrent writing to the same key

  • support for ttl (time to live) at retrieval

  • support for deleting expired data

Finally, there are multiple data models for storing the features in the online store. For example, features could be:

  • collocated by feature view

  • collocated by feature service

  • collocated by entity key

Functionality Matrix

Below is a matrix indicating which online stores support what functionality.

Sqlite
Redis
DynamoDB
Snowflake
Datastore
Postgres
Hbase
Cassandra

write feature values to the online store

yes

yes

yes

yes

yes

yes

yes

yes

read feature values from the online store

yes

yes

yes

yes

yes

yes

yes

yes

update infrastructure (e.g. tables) in the online store

yes

yes

yes

yes

yes

yes

yes

yes

teardown infrastructure (e.g. tables) in the online store

yes

yes

yes

yes

yes

yes

yes

yes

generate a plan of infrastructure changes

yes

no

no

no

no

no

no

yes

support for on-demand transforms

yes

yes

yes

yes

yes

yes

yes

yes

readable by Python SDK

yes

yes

yes

yes

yes

yes

yes

yes

readable by Java

no

yes

no

no

no

no

no

no

readable by Go

yes

yes

no

no

no

no

no

no

support for entityless feature views

yes

yes

yes

yes

yes

yes

yes

yes

support for concurrent writing to the same key

no

yes

no

no

no

no

no

no

support for ttl (time to live) at retrieval

no

yes

no

no

no

no

no

no

support for deleting expired data

no

yes

no

no

no

no

no

no

collocated by feature view

yes

no

yes

yes

yes

yes

yes

yes

collocated by feature service

no

no

no

no

no

no

no

no

collocated by entity key

no

yes

no

no

no

no

no

no

Online stores

Redis

Description

  • Both Redis and Redis Cluster are supported.

Getting started

In order to use this online store, you'll need to install the redis extra (along with the dependency needed for the offline store of choice). E.g.

  • pip install 'feast[gcp, redis]'

  • pip install 'feast[snowflake, redis]'

  • pip install 'feast[aws, redis]'

  • pip install 'feast[azure, redis]'

You can get started by using any of the other templates (e.g. feast init -t gcp or feast init -t snowflake or feast init -t aws), and then swapping in Redis as the online store as seen below in the examples.

Examples

Connecting to a single Redis instance:

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"

Connecting to a Redis Cluster with SSL enabled and password authentication:

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  redis_type: redis_cluster
  connection_string: "redis1:6379,redis2:6379,ssl=true,password=my_password"

Functionality Matrix

Redis

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

no

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

yes

readable by Go

yes

support for entityless feature views

yes

support for concurrent writing to the same key

yes

support for ttl (time to live) at retrieval

yes

support for deleting expired data

yes

collocated by feature view

no

collocated by feature service

no

collocated by entity key

yes

SQLite

Description

  • All feature values are stored in an on-disk SQLite database

  • Only the latest feature values are persisted

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: sqlite
  path: data/online_store.db

Functionality Matrix

Sqlite

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

yes

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

yes

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

Datastore

Description

Getting started

In order to use this online store, you'll need to run pip install 'feast[gcp]'. You can then get started with the command feast init REPO_NAME -t gcp.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: gcp
online_store:
  type: datastore
  project_id: my_gcp_project
  namespace: my_datastore_namespace

Functionality Matrix

Datastore

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

no

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

no

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

Snowflake

Description

  • Only the latest feature values are persisted

The data model for using a Snowflake Transient Table as an online store follows a tall format (one row per feature)):

  • "entity_feature_key" (BINARY) -- unique key used when reading specific feature_view x entity combination

  • "entity_key" (BINARY) -- repeated key currently unused for reading entity_combination

  • "feature_name" (VARCHAR)

  • "value" (BINARY)

  • "event_ts" (TIMESTAMP)

  • "created_ts" (TIMESTAMP)

(This model may be subject to change when Snowflake Hybrid Tables are released)

Getting started

In order to use this online store, you'll need to run pip install 'feast[snowflake]'. You can then get started with the command feast init REPO_NAME -t snowflake.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
    type: snowflake.online
    account: SNOWFLAKE_DEPLOYMENT_URL
    user: SNOWFLAKE_USER
    password: SNOWFLAKE_PASSWORD
    role: SNOWFLAKE_ROLE
    warehouse: SNOWFLAKE_WAREHOUSE
    database: SNOWFLAKE_DATABASE

Tags KWARGs Actions:

"snowflake-online-store/online_path": Adding the "snowflake-online-store/online_path" key to a FeatureView tags parameter allows you to choose the online table path for the online serving table (ex. "{database}"."{schema}").

example_config.py
driver_stats_fv = FeatureView(
    ...
    tags={"snowflake-online-store/online_path": '"FEAST"."ONLINE"'},
)

Functionality Matrix

Snowflake

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

no

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

no

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

DynamoDB

Description

Getting started

In order to use this online store, you'll need to run pip install 'feast[aws]'. You can then get started with the command feast init REPO_NAME -t aws.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
online_store:
  type: dynamodb
  region: us-west-2

Permissions

Feast requires the following permissions in order to execute commands for DynamoDB online store:

Command

Permissions

Resources

Apply

dynamodb:CreateTable

dynamodb:DescribeTable

dynamodb:DeleteTable

arn:aws:dynamodb:<region>:<account_id>:table/*

Materialize

dynamodb.BatchWriteItem

arn:aws:dynamodb:<region>:<account_id>:table/*

Get Online Features

dynamodb.BatchGetItem

arn:aws:dynamodb:<region>:<account_id>:table/*

The following inline policy can be used to grant Feast the necessary permissions:

{
    "Statement": [
        {
            "Action": [
                "dynamodb:CreateTable",
                "dynamodb:DescribeTable",
                "dynamodb:DeleteTable",
                "dynamodb:BatchWriteItem",
                "dynamodb:BatchGetItem"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:dynamodb:<region>:<account_id>:table/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Functionality Matrix

DynamoDB

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

no

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

no

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

PostgreSQL (contrib)

Description

The PostgreSQL online store provides support for materializing feature values into a PostgreSQL database for serving online features.

  • Only the latest feature values are persisted

  • sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional

Getting started

In order to use this online store, you'll need to run pip install 'feast[postgres]'. You can get started by then running feast init -t postgres.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
    type: postgres
    host: DB_HOST
    port: DB_PORT
    database: DB_NAME
    db_schema: DB_SCHEMA
    user: DB_USERNAME
    password: DB_PASSWORD
    sslmode: verify-ca
    sslkey_path: /path/to/client-key.pem
    sslcert_path: /path/to/client-cert.pem
    sslrootcert_path: /path/to/server-ca.pem

Functionality Matrix

Postgres

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

no

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

no

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

Cassandra + Astra DB (contrib)

Description

The [Cassandra / Astra DB] online store provides support for materializing feature values into an Apache Cassandra / Astra DB database for online features.

  • The whole project is contained within a Cassandra keyspace

  • Each feature view is mapped one-to-one to a specific Cassandra table

  • This implementation inherits all strengths of Cassandra such as high availability, fault-tolerance, and data distribution

Getting started

In order to use this online store, you'll need to run pip install 'feast[cassandra]'. You can then get started with the command feast init REPO_NAME -t cassandra.

Example (Cassandra)

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
    type: cassandra
    hosts:
        - 192.168.1.1
        - 192.168.1.2
        - 192.168.1.3
    keyspace: KeyspaceName
    port: 9042                                                              # optional
    username: user                                                          # optional
    password: secret                                                        # optional
    protocol_version: 5                                                     # optional
    load_balancing:                                                         # optional
        local_dc: 'datacenter1'                                             # optional
        load_balancing_policy: 'TokenAwarePolicy(DCAwareRoundRobinPolicy)'  # optional

Example (Astra DB)

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
    type: cassandra
    secure_bundle_path: /path/to/secure/bundle.zip
    keyspace: KeyspaceName
    username: Client_ID
    password: Client_Secret
    protocol_version: 4                                                     # optional
    load_balancing:                                                         # optional
        local_dc: 'eu-central-1'                                            # optional
        load_balancing_policy: 'TokenAwarePolicy(DCAwareRoundRobinPolicy)'  # optional

Storage specifications can be found at docs/specs/online_store_format.md.

Functionality Matrix

Cassandra

write feature values to the online store

yes

read feature values from the online store

yes

update infrastructure (e.g. tables) in the online store

yes

teardown infrastructure (e.g. tables) in the online store

yes

generate a plan of infrastructure changes

yes

support for on-demand transforms

yes

readable by Python SDK

yes

readable by Java

no

readable by Go

no

support for entityless feature views

yes

support for concurrent writing to the same key

no

support for ttl (time to live) at retrieval

no

support for deleting expired data

no

collocated by feature view

yes

collocated by feature service

no

collocated by entity key

no

Providers

Azure

Description

  • Offline Store: Uses the MsSql offline store by default. Also supports File as the offline store.

  • Online Store: Uses the Redis online store by default. Also supports Sqlite as an online store.

Disclaimer

The Azure provider does not achieve full test coverage. Please do not assume complete stability.

Getting started

Example

feature_store.yaml
registry:
  registry_store_type: AzureRegistryStore
  path: ${REGISTRY_PATH} # Environment Variable
project: production
provider: azure
online_store:
    type: redis
    connection_string: ${REDIS_CONN} # Environment Variable

Snowflake

Description

The engine requires no additional configuration other than for you to supply Snowflake's standard login and context details. The engine leverages custom (automatically deployed for you) Python UDFs to do the proper serialization of your offline store data to your online serving tables.

When using all three options together, snowflake.offline, snowflake.engine, and snowflake.online, you get the most unique experience of unlimited scale and performance + governance and data security.

Example

feature_store.yaml
...
offline_store:
  type: snowflake.offline
...
batch_engine:
  type: snowflake.engine
  account: snowflake_deployment.us-east-1
  user: user_login
  password: user_password
  role: sysadmin
  warehouse: demo_wh
  database: FEAST

Batch Materialization Engines

.feastignore

Overview

.feastignore
# Ignore virtual environment
venv

# Ignore a specific Python file
scripts/foo.py

# Ignore all Python files directly under scripts directory
scripts/*.py

# Ignore all "foo.py" anywhere under scripts directory
scripts/**/foo.py

.feastignore file is optional. If the file can not be found, every Python file in the feature repo directory will be parsed by feast apply.

Feast Ignore Patterns

Pattern
Example matches
Explanation

venv

venv/foo.py venv/a/foo.py

You can specify a path to a specific directory. Everything in that directory will be ignored.

scripts/foo.py

scripts/foo.py

You can specify a path to a specific file. Only that file will be ignored.

scripts/*.py

scripts/foo.py scripts/bar.py

You can specify an asterisk (*) anywhere in the expression. An asterisk matches zero or more characters, except "/".

scripts/**/foo.py

scripts/foo.py scripts/a/foo.py scripts/a/b/foo.py

You can specify a double asterisk (**) anywhere in the expression. A double asterisk matches zero or more directories.

Bytewax

Description

Guide

Kubernetes Authentication

Resource Authentication

To configure secrets, first create them using kubectl:

kubectl create secret generic -n bytewax aws-credentials --from-literal=aws-access-key-id='<access key id>' --from-literal=aws-secret-access-key='<secret access key>'

Then configure them in the batch_engine section of feature_store.yaml:

batch_engine:
  type: bytewax
  namespace: bytewax
  env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-access-key-id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-secret-access-key

Configuration

The Bytewax materialization engine is configured through the The feature_store.yaml configuration file:

batch_engine:
  type: bytewax
  namespace: bytewax
  image: bytewax/bytewax-feast:latest

Building a custom Bytewax Docker image

The image configuration directive specifies which container image to use when running the materialization job. To create a custom image based on this container, run the following command:

DOCKER_BUILDKIT=1 docker build . -f ./sdk/python/feast/infra/materialization/contrib/bytewax/Dockerfile -t <image tag>

Once that image is built and pushed to a registry, it can be specified as a part of the batch engine configuration:

batch_engine:
  type: bytewax
  namespace: bytewax
  image: <image tag>

[Alpha] AWS Lambda feature server

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Overview

Deployment

The AWS Lambda feature server is only available to projects using the AwsProvider with registries on S3. It is disabled by default. To enable it, feature_store.yaml must be modified; specifically, the enable flag must be on and an execution_role_name must be specified. For example, after running feast init -t aws, changing the registry to be on S3, and enabling the feature server, the contents of feature_store.yaml should look similar to the following:

project: dev
registry: s3://feast/registries/dev
provider: aws
online_store:
  region: us-west-2
offline_store:
  cluster_id: feast
  region: us-west-2
  user: admin
  database: feast
  s3_staging_location: s3://feast/redshift/tests/staging_location
  iam_role: arn:aws:iam::{aws_account}:role/redshift_s3_access_role
feature_server:
  enabled: True
  execution_role_name: arn:aws:iam::{aws_account}:role/lambda_execution_role

If enabled, the feature server will be deployed during feast apply. After it is deployed, the feast endpoint CLI command will indicate the server's endpoint.

Permissions

Feast requires the following permissions in order to deploy and teardown AWS Lambda feature server:

Permissions
Resources

lambda:CreateFunction

lambda:GetFunction

lambda:DeleteFunction

lambda:AddPermission

lambda:UpdateFunctionConfiguration

arn:aws:lambda:<region>:<account_id>:function:feast-*

ecr:CreateRepository

ecr:DescribeRepositories

ecr:DeleteRepository

ecr:PutImage

ecr:DescribeImages

ecr:BatchDeleteImage

ecr:CompleteLayerUpload

ecr:UploadLayerPart

ecr:InitiateLayerUpload

ecr:BatchCheckLayerAvailability

ecr:GetDownloadUrlForLayer

ecr:GetRepositoryPolicy

ecr:SetRepositoryPolicy

ecr:GetAuthorizationToken

*

iam:PassRole

arn:aws:iam::<account_id>:role/

apigateway:*

arn:aws:apigateway:::/apis//routes//routeresponses

arn:aws:apigateway:::/apis//routes//routeresponses/

arn:aws:apigateway:::/apis//routes/

arn:aws:apigateway:::/apis//routes

arn:aws:apigateway:::/apis//integrations

arn:aws:apigateway:::/apis//stages//routesettings/

arn:aws:apigateway:::/apis/

arn:aws:apigateway:*::/apis

The following inline policy can be used to grant Feast the necessary permissions:

{
    "Statement": [
        {
        Action = [
          "lambda:CreateFunction",
          "lambda:GetFunction",
          "lambda:DeleteFunction",
          "lambda:AddPermission",
          "lambda:UpdateFunctionConfiguration",
        ]
        Effect = "Allow"
        Resource = "arn:aws:lambda:<region>:<account_id>:function:feast-*"
      },
      {
        Action = [
            "ecr:CreateRepository",
            "ecr:DescribeRepositories",
            "ecr:DeleteRepository",
            "ecr:PutImage",
            "ecr:DescribeImages",
            "ecr:BatchDeleteImage",
            "ecr:CompleteLayerUpload",
            "ecr:UploadLayerPart",
            "ecr:InitiateLayerUpload",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:GetRepositoryPolicy",
            "ecr:SetRepositoryPolicy",
            "ecr:GetAuthorizationToken"
        ]
        Effect = "Allow"
        Resource = "*"
      },
      {
        Action = "iam:PassRole"
        Effect = "Allow"
        Resource = "arn:aws:iam::<account_id>:role/<lambda-execution-role-name>"
      },
      {
        Effect = "Allow"
        Action = "apigateway:*"
        Resource = [
            "arn:aws:apigateway:*::/apis/*/routes/*/routeresponses",
            "arn:aws:apigateway:*::/apis/*/routes/*/routeresponses/*",
            "arn:aws:apigateway:*::/apis/*/routes/*",
            "arn:aws:apigateway:*::/apis/*/routes",
            "arn:aws:apigateway:*::/apis/*/integrations",
            "arn:aws:apigateway:*::/apis/*/stages/*/routesettings/*",
            "arn:aws:apigateway:*::/apis/*",
            "arn:aws:apigateway:*::/apis",
        ]
      },
    ],
    "Version": "2012-10-17"
}

Example

After feature_store.yaml has been modified as described in the previous section, it can be deployed as follows:

$ feast apply
10/07/2021 03:57:26 PM INFO:Pulling remote image feastdev/feature-server-python-aws:aws:
10/07/2021 03:57:28 PM INFO:Creating remote ECR repository feast-python-server-key_shark-0_13_1_dev23_gb3c08320:
10/07/2021 03:57:29 PM INFO:Pushing local image to remote 402087665549.dkr.ecr.us-west-2.amazonaws.com/feast-python-server-key_shark-0_13_1_dev23_gb3c08320:0_13_1_dev23_gb3c08320:
10/07/2021 03:58:44 PM INFO:Deploying feature server...
10/07/2021 03:58:45 PM INFO:  Creating AWS Lambda...
10/07/2021 03:58:46 PM INFO:  Creating AWS API Gateway...
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats

$ feast endpoint
10/07/2021 03:59:01 PM INFO:Feature server endpoint: https://hkosgmz4m2.execute-api.us-west-2.amazonaws.com

$ feast materialize-incremental $(date +%Y-%m-%d)
Materializing 1 feature views to 2021-10-06 17:00:00-07:00 into the dynamodb online store.

driver_hourly_stats from 2020-10-08 23:01:34-07:00 to 2021-10-06 17:00:00-07:00:
100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 16.89it/s]

After the feature server starts, we can execute cURL commands against it:

$ curl -X POST \                                 
    "https://hkosgmz4m2.execute-api.us-west-2.amazonaws.com/get-online-features" \
    -H "Content-type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "features": [
            "driver_hourly_stats:conv_rate",
            "driver_hourly_stats:acc_rate",
            "driver_hourly_stats:avg_daily_trips"
        ],
        "entities": {
            "driver_id": [1001, 1002, 1003]
        },
        "full_feature_names": true
    }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1346  100  1055  100   291   3436    947 --:--:-- --:--:-- --:--:--  4370
{
  "field_values": [
    {
      "fields": {
        "driver_id": 1001,
        "driver_hourly_stats__conv_rate": 0.025330161675810814,
        "driver_hourly_stats__avg_daily_trips": 785,
        "driver_hourly_stats__acc_rate": 0.835975170135498
      },
      "statuses": {
        "driver_hourly_stats__avg_daily_trips": "PRESENT",
        "driver_id": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT",
        "driver_hourly_stats__acc_rate": "PRESENT"
      }
    },
    {
      "fields": {
        "driver_hourly_stats__conv_rate": 0.7595187425613403,
        "driver_hourly_stats__acc_rate": 0.1740121990442276,
        "driver_id": 1002,
        "driver_hourly_stats__avg_daily_trips": 875
      },
      "statuses": {
        "driver_hourly_stats__acc_rate": "PRESENT",
        "driver_id": "PRESENT",
        "driver_hourly_stats__avg_daily_trips": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT"
      }
    },
    {
      "fields": {
        "driver_hourly_stats__acc_rate": 0.7785481214523315,
        "driver_hourly_stats__conv_rate": 0.33832859992980957,
        "driver_hourly_stats__avg_daily_trips": 846,
        "driver_id": 1003
      },
      "statuses": {
        "driver_id": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT",
        "driver_hourly_stats__acc_rate": "PRESENT",
        "driver_hourly_stats__avg_daily_trips": "PRESENT"
      }
    }
  ]
}

Feature repository

Feast users use Feast to manage two important sets of configuration:

  • Configuration about how to run Feast on your infrastructure

  • Feature definitions

With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.

The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.

What is a feature repository?

A feature repository consists of:

  • A collection of Python files containing feature declarations.

  • A feature_store.yaml file containing infrastructural configuration.

  • A .feastignore file containing paths in the feature repository to ignore.

Typically, users store their feature repositories in a Git repository, especially when working in teams. However, using Git is not a requirement.

Structure of a feature repository

The structure of a feature repository is as follows:

  • The root of the repository should contain a feature_store.yaml file and may contain a .feastignore file.

  • The repository should contain Python files that contain feature definitions.

  • The repository can contain other files as well, including documentation and potentially data files.

An example structure of a feature repository is shown below:

$ tree -a
.
├── data
│   └── driver_stats.parquet
├── driver_features.py
├── feature_store.yaml
└── .feastignore

1 directory, 4 files

A couple of things to note about the feature repository:

  • Feast reads all Python files recursively when feast apply is ran, including subdirectories, even if they don't contain feature definitions.

  • It's recommended to add .feastignore and add paths to all imperative scripts if you need to store them inside the feature registry.

The feature_store.yaml configuration file

The configuration for a feature store is stored in a file named feature_store.yaml , which must be located at the root of a feature repository. An example feature_store.yaml file is shown below:

feature_store.yaml
project: my_feature_repo_1
registry: data/metadata.db
provider: local
online_store:
    path: data/online_store.db

The .feastignore file

This file contains paths that should be ignored when running feast apply. An example .feastignore is shown below:

.feastignore
# Ignore virtual environment
venv

# Ignore a specific Python file
scripts/foo.py

# Ignore all Python files directly under scripts directory
scripts/*.py

# Ignore all "foo.py" anywhere under scripts directory
scripts/**/foo.py

Feature definitions

A feature repository can also contain one or more Python files that contain feature definitions. An example feature definition file is shown below:

driver_features.py
from datetime import timedelta

from feast import BigQuerySource, Entity, Feature, FeatureView, Field
from feast.types import Float32, Int64, String

driver_locations_source = BigQuerySource(
    table_ref="rh_prod.ride_hailing_co.drivers",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp",
)

driver = Entity(
    name="driver",
    description="driver id",
)

driver_locations = FeatureView(
    name="driver_locations",
    entities=[driver],
    ttl=timedelta(days=1),
    schema=[
        Field(name="lat", dtype=Float32),
        Field(name="lon", dtype=String),
        Field(name="driver", dtype=Int64),
    ],
    source=driver_locations_source,
)

Next steps

Python feature server

Overview

The Python feature server is an HTTP endpoint that serves features with JSON I/O. This enables users to write and read features from the online store using any programming language that can make HTTP requests.

CLI

There is a CLI command that starts the server: feast serve. By default, Feast uses port 6566; the port be overridden with a --port flag.

Deploying as a service

Example

Initializing a feature server

Here's an example of how to start the Python feature server with a local feature repo:

$ feast init feature_repo
Creating a new Feast repository in /home/tsotne/feast/feature_repo.

$ cd feature_repo

$ feast apply
Created entity driver
Created feature view driver_hourly_stats
Created feature service driver_activity

Created sqlite table feature_repo_driver_hourly_stats

$ feast materialize-incremental $(date +%Y-%m-%d)
Materializing 1 feature views to 2021-09-09 17:00:00-07:00 into the sqlite online store.

driver_hourly_stats from 2021-09-09 16:51:08-07:00 to 2021-09-09 17:00:00-07:00:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 295.24it/s]

$ feast serve
09/10/2021 10:42:11 AM INFO:Started server process [8889]
INFO:     Waiting for application startup.
09/10/2021 10:42:11 AM INFO:Waiting for application startup.
INFO:     Application startup complete.
09/10/2021 10:42:11 AM INFO:Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:6566 (Press CTRL+C to quit)
09/10/2021 10:42:11 AM INFO:Uvicorn running on http://127.0.0.1:6566 (Press CTRL+C to quit)

Retrieving features

After the server starts, we can execute cURL commands from another terminal tab:

$  curl -X POST \
  "http://localhost:6566/get-online-features" \
  -d '{
    "features": [
      "driver_hourly_stats:conv_rate",
      "driver_hourly_stats:acc_rate",
      "driver_hourly_stats:avg_daily_trips"
    ],
    "entities": {
      "driver_id": [1001, 1002, 1003]
    }
  }' | jq
{
  "metadata": {
    "feature_names": [
      "driver_id",
      "conv_rate",
      "avg_daily_trips",
      "acc_rate"
    ]
  },
  "results": [
    {
      "values": [
        1001,
        0.7037263512611389,
        308,
        0.8724706768989563
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    },
    {
      "values": [
        1002,
        0.038169607520103455,
        332,
        0.48534533381462097
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    },
    {
      "values": [
        1003,
        0.9665873050689697,
        779,
        0.7793770432472229
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    }
  ]
}

It's also possible to specify a feature service name instead of the list of features:

curl -X POST \
  "http://localhost:6566/get-online-features" \
  -d '{
    "feature_service": <feature-service-name>,
    "entities": {
      "driver_id": [1001, 1002, 1003]
    }
  }' | jq

Pushing features to the online and offline stores

The request definition for pushmode is a string parameter to where the options are: ["online", "offline", "online_and_offline"]. Note that timestamps need to be strings.

curl -X POST "http://localhost:6566/push" -d '{
    "push_source_name": "driver_hourly_stats_push_source",
    "df": {
            "driver_id": [1001],
            "event_timestamp": ["2022-05-13 10:59:42"],
            "created": ["2022-05-13 10:59:42"],
            "conv_rate": [1.0],
            "acc_rate": [1.0],
            "avg_daily_trips": [1000]
    },
    "to": "online_and_offline",
  }' | jq

or equivalently from Python:

import json
import requests
import pandas as pd
from datetime import datetime

event_dict = {
    "driver_id": [1001],
    "event_timestamp": [str(datetime(2021, 5, 13, 10, 59, 42))],
    "created": [str(datetime(2021, 5, 13, 10, 59, 42))],
    "conv_rate": [1.0],
    "acc_rate": [1.0],
    "avg_daily_trips": [1000],
    "string_feature": "test2",
}
push_data = {
    "push_source_name":"driver_stats_push_source",
    "df":event_dict,
    "to":"online",
}
requests.post(
    "http://localhost:6566/push",
    data=json.dumps(push_data))

Feature servers

Feast users can choose to retrieve features from a feature server, as opposed to through the Python SDK.

[Alpha] Go feature server

Overview

CLI

By default, the Go feature server is turned off. To turn it on you can add go_feature_serving: True to your feature_store.yaml:

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
go_feature_serving: True

Then the feast serve CLI command will start the Go feature server. As with Python, the Go feature server uses port 6566 by default; the port be overridden with a --port flag. Moreover, the server uses HTTP by default, but can be set to use gRPC with --type=grpc.

Alternatively, if you wish to experiment with the Go feature server instead of permanently turning it on, you can just run feast serve --go.

Installation

The Go component comes pre-compiled when you install Feast with Python versions 3.8-3.10 on macOS or Linux (on x86). In order to install the additional Python dependencies, you should install Feast with

pip install feast[go]

For macOS, run brew install apache-arrow. For linux users, you have to install libarrow-dev.

sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev # For C++

For developers, if you want to build from source, run make compile-go-lib to build and compile the go server. In order to build the go binaries, you will need to install the apache-arrow c++ libraries.

Alpha features

Feature logging

The Go feature server can log all requested entities and served features to a configured destination inside an offline store. This allows users to create new datasets from features served online. Those datasets could be used for future trainings or for feature validations. To enable feature logging we need to edit feature_store.yaml:

project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
go_feature_serving: True
feature_server:
  feature_logging:
    enabled: True

Feature logging configuration in feature_store.yaml also allows to tweak some low-level parameters to achieve the best performance:

feature_server:
  feature_logging:
    enabled: True
    flush_interval_secs: 300
    write_to_disk_interval_secs: 30
    emit_timeout_micro_secs: 10000
    queue_capacity: 10000

All these parameters are optional.

Python SDK retrieval

The logic for the Go feature server can also be used to retrieve features during a Python get_online_features call. To enable this behavior, you must add go_feature_retrieval: True to your feature_store.yaml. You must also have all the dependencies installed as detailed above.

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

Try it and let us know what you think!

Data ingestion

Data source

A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.

Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.

Below is an example data source with a single entity column (driver) and two feature columns (trips_today, and rating).

Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:

  • Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.

  • Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:

    • Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.

    • [Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.

Batch data ingestion

Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).

A key command to use in Feast is the materialize_incremental command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.

Materialization can be called programmatically or through the CLI:

Code example: programmatic scheduled materialization

This snippet creates a feature store object which points to the registry (which knows of all defined features) and the online store (DynamoDB in this case), and

# Define Python callable
def materialize():
  repo_config = RepoConfig(
    registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
    project="feast_demo_aws",
    provider="aws",
    offline_store="file",
    online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
  )
  store = FeatureStore(config=repo_config)
  store.materialize_incremental(datetime.datetime.now())

# (In production) Use Airflow PythonOperator
materialize_python = PythonOperator(
    task_id='materialize_python',
    python_callable=materialize,
)
Code example: CLI based materialization

How to run this in the CLI

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

How to run this on Airflow

# Use BashOperator
materialize_bash = BashOperator(
    task_id='materialize',
    bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
)

Batch data schema inference

If the schema parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT clause and inspecting the result.

Stream data ingestion

Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.

FAQ

Don't see your question?

Getting started

Do you have any examples of how Feast should be used?

Concepts

Do feature views have to include entities?

How does Feast handle model or feature versioning?

Feast expects that each version of a model corresponds to a different feature service.

Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, feast plan and feast apply will throw errors if it sees this kind of behavior.

What is the difference between data sources and the offline store?

Is it possible to have offline and online stores from different providers?

Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.

Functionality

How do I run get_historical_features without providing an entity dataframe?

Does Feast provide security or access control?

Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).

It is a good idea though to lock down the registry file so only the CI/CD pipeline can modify it. That way data scientists and other users cannot accidentally modify the registry and lose other team's data.

Does Feast support streaming sources?

Does Feast support feature transformation?

There are several kinds of transformations:

    • These transformations are Pandas transformations run on batch data when you call get_historical_features and at online serving time when you call `get_online_features.

    • Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well

    • These will include SQL + PySpark based transformations on batch data sources.

  • Streaming transformations (RFC in progress)

Does Feast have a Web UI?

Does Feast support composite keys?

A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.

How does Feast compare with Tecton?

What are the performance/latency characteristics of Feast?

Does Feast support embeddings and list features?

Yes. Specifically:

  • Simple lists / dense embeddings:

    • BigQuery supports list types natively

    • Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)

  • Sparse embeddings (e.g. one hot encodings)

Does Feast support X storage engine?

Does Feast support using different clouds for offline vs online stores?

Yes. Using a GCP or AWS provider in feature_store.yaml primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.

What is the difference between a data source and an offline store?

The data source and the offline store are closely tied, but separate concepts. The offline store controls how feast talks to a data store for historical feature retrieval, and the data source points to specific table (or query) within a data store. Offline stores are infrastructure-level connectors to data stores like Snowflake.

Additional differences:

  • Data sources may be specific to a project (e.g. feed ranking), but offline stores are agnostic and used across projects.

  • A feast project may define several data sources that power different feature views, but a feast project has a single offline store.

  • Feast users typically need to define data sources when using feast, but only need to use/configure existing offline stores without creating new ones.

How can I add a custom online store?

Can the same storage engine be used for both the offline and online store?

Yes. For example, the Postgres connector can be used as both an offline and online store (as well as the registry).

Does Feast support S3 as a data source?

Yes. There are two ways to use S3 in Feast:

  • Using the s3_endpoint_override in a FileSource data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.

Is Feast planning on supporting X functionality?

Project

How do I contribute to Feast?

Feast 0.9 (legacy)

What is the difference between Feast 0.9 and Feast 0.10+?

How do I migrate from Feast 0.9 to Feast 0.10+?

What are the plans for Feast Core, Feast Serving, and Feast Spark?

Using Scalable Registry

Tutorial on how to use the SQL registry for scalable registry updates

Overview

By default, the registry Feast uses a file-based registry implementation, which stores the protobuf representation of the registry as a serialized file. This registry file can be stored in a local file system, or in cloud storage (in, say, S3 or GCS).

However, there's inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

  • PostgreSQL

  • MySQL

  • Sqlite

Feast can use the SQL Registry via a config change in the feature_store.yaml file. An example of how to configure this would be:

There are some things to note about how the SQL registry works:

  • Once instantiated, the Registry ensures the tables needed to store data exist, and creates them if they do not.

  • Upon tearing down the feast project, the registry ensures that the tables are dropped from the database.

  • The schema for how data is laid out in tables can be found . It is intentionally simple, storing the serialized protobuf versions of each Feast object keyed by its name.

Example Usage: Concurrent materialization

The SQL Registry should be used when materializing feature views concurrently to ensure correctness of data in the registry. This can be achieved by simply running feast materialize or feature_store.materialize multiple times using a correctly configured feature_store.yaml. This will make each materialization process talk to the registry database concurrently, and ensure the metadata updates are serialized.

Build a training dataset

Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.

Retrieving historical features

1. Register your feature views

Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.

2. Define feature references

Start by defining the feature references (e.g., driver_trips:average_daily_rides) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.

3. Create an entity dataframe

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.

It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.

Pandas:

In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp column and a driver_id entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.

SQL (Alternative):

Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).

4. Launch historical retrieval

Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features(). This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

Deploy a feature store

Deploying

To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

Cleaning up

If you need to clean up the infrastructure created by feast apply, use the teardown command.

Warning: teardown is an irreversible command and will remove all feature store infrastructure. Proceed with caution!

****

See this for a discussion around the tradeoffs of each of these data models.

There are currently five core online store implementations: SqliteOnlineStore, RedisOnlineStore, DynamoDBOnlineStore, SnowflakeOnlineStore, and DatastoreOnlineStore. There are several additional implementations contributed by the Feast community (PostgreSQLOnlineStore, HbaseOnlineStore, and CassandraOnlineStore), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific online store, such as how to configure it in a feature_store.yaml, can be found .

Please see for an explanation of online stores.

The online store provides support for materializing feature values into Redis.

The data model used to store feature values in Redis is described in more detail .

The full set of configuration options is available in .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Redis online store.

To compare this set of functionality against other online stores, please see the full .

The online store provides support for materializing feature values into an SQLite database for serving online features.

The full set of configuration options is available in .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Sqlite online store.

To compare this set of functionality against other online stores, please see the full .

The online store provides support for materializing feature values into Cloud Datastore. The data model used to store feature values in Datastore is described in more detail .

The full set of configuration options is available in .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Datastore online store.

To compare this set of functionality against other online stores, please see the full .

The online store provides support for materializing feature values into a Snowflake Transient Table for serving online features.

The full set of configuration options is available in .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Snowflake online store.

To compare this set of functionality against other online stores, please see the full .

The online store provides support for materializing feature values into AWS DynamoDB.

The full set of configuration options is available in .

Lastly, this IAM role needs to be associated with the desired Redshift cluster. Please follow the official AWS guide for the necessary steps .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the DynamoDB online store.

To compare this set of functionality against other online stores, please see the full .

The full set of configuration options is available in .

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Postgres online store.

To compare this set of functionality against other online stores, please see the full .

The full set of configuration options is available in . For a full explanation of configuration options please look at file sdk/python/feast/infra/online_stores/contrib/cassandra_online_store/README.md.

The set of functionality supported by online stores is described in detail . Below is a matrix indicating which functionality is supported by the Cassandra online store.

To compare this set of functionality against other online stores, please see the full .

Please see for an explanation of providers.

In order to use this offline store, you'll need to run pip install 'feast[azure]'. You can get started by then following this .

The batch materialization engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (materialize and materialize-incremental) when using a SnowflakeSource.

Please see for an explanation of batch materialization engines.

.feastignore is a file that is placed at the root of the . This file contains paths that should be ignored when running feast apply. An example .feastignore is shown below:

The batch materialization engine provides an execution engine for batch materializing operations (materialize and materialize-incremental).

In order to use the Bytewax materialization engine, you will need a cluster running version 1.22.10 or greater.

The Bytewax materialization engine loads authentication and cluster information from the . By default, kubectl looks for a file named config in the $HOME/.kube directory. You can specify other kubeconfig files by setting the KUBECONFIG environment variable.

Bytewax jobs can be configured to access as environment variables to access online and offline stores during job runs.

The namespace configuration directive specifies which Kubernetes jobs, services and configuration maps will be created in.

The AWS Lambda feature server is an HTTP endpoint that serves features with JSON I/O, deployed as a Docker image through AWS Lambda and AWS API Gateway. This enables users to get features from Feast using any programming language that can make HTTP requests. A is also available. A remote feature server on GCP Cloud Run is currently being developed.

The feature_store.yaml file configures how the feature store should run. See for more details.

See for more details.

To declare new feature definitions, just add code to the feature repository, either in existing files or in a new file. For more information on how to define features, see .

See to get started with an example feature repository.

See , , or for more information on the configuration files that live in a feature registry.

One can deploy a feature server by building a docker image that bundles in the project's feature_store.yaml. See this for an example on how to run Feast on Kubernetes.

A on AWS Lambda is also available.

The Python feature server also exposes an endpoint for . This endpoint allows you to push data to the online and/or offline store.

The Go feature server is an HTTP/gRPC endpoint that serves features. It is written in Go, and is therefore significantly faster than the Python feature server. See this for more details on the comparison between Python and Go. In general, we recommend the Go feature server for all production use cases that require extremely low-latency feature serving. Currently only the Redis and SQLite online stores are supported.

You must also install the Apache Arrow C++ libraries. This is because the Go feature server uses the cgo memory allocator from the Apache Arrow C++ library for interoperability between Go and Python, to prevent memory from being accidentally garbage collected when executing on-demand feature views. You can read more about the usage of the cgo memory allocator in these .

This tutorial guides you on how to use Feast with . You will learn how to:

Train a model locally (on your laptop) using data from

Test the model for online inference using (for fast iteration)

Test the model for online inference using (for production use)

(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into , which allow light-weight feature engineering and combining features across sources.

To push data into the offline or online stores: see for details.

(experimental) To use a contrib Spark processor to ingest from a topic, see

We encourage you to ask questions on or . Even better, once you get an answer, add the answer to this FAQ via a !

The is the easiest way to learn about Feast. For more detailed tutorials, please check out the page.

No, there are .

The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see and for more details.

Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See

Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support . Feast also defines a that allows a deeper integration with stream sources.

On demand transformations (See )

Batch transformations (WIP, see )

Yes. See .

Please see a detailed comparison of Feast vs. Tecton . For another comparison, please see .

Feast is designed to work at scale and support low latency online serving. See our for details.

Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see )

One way to do this efficiently is to have a protobuf or string representation of

The list of supported offline and online stores can be found and , respectively. The indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about customizing Feast .

Please follow the instructions .

Using Redshift as a data source via Spectrum (), and then continuing with the guide. See a we did on this at our apply() meetup.

Please see the .

For more details on contributing to the Feast community, see and this .

Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this for more details.

Please see this . If you have any questions or suggestions, feel free to leave a comment on the document!

Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the .

An alternative to the file-based registry is the which ships with Feast. This implementation stores the registry in a relational database, and allows for changes to individual objects atomically. Under the hood, the SQL Registry implementation uses to abstract over the different databases. Consequently, any by SQLAlchemy can be used by the SQL Registry. The following databases are supported and tested out of the box:

Specifically, the registry_type needs to be set to sql in the registry config block. On doing so, the path should refer to the for the database to be used, as expected by SQLAlchemy. No other additional commands are currently needed to configure this registry.

The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your file, as well as the feature definitions found in your feature repository.

Here we'll be using the example repository we created in the previous guide, . You can re-create it by running feast init in a new directory.

At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize. See for more details.

issue
here
Online Store
Overview
SQLite
Snowflake
Redis
Datastore
DynamoDB
PostgreSQL (contrib)
Cassandra + Astra DB (contrib)
Redis
here
RedisOnlineStoreConfig
SQLite
SqliteOnlineStoreConfig
Datastore
here
DatastoreOnlineStoreConfig
Snowflake
SnowflakeOnlineStoreConfig
DynamoDB
DynamoDBOnlineStoreConfig
here
PostgreSQLOnlineStoreConfig
CassandraOnlineStoreConfig
Provider
Local
Google Cloud Platform
Amazon Web Services
Azure
tutorial
Snowflake
Batch Materialization Engine
Snowflake
Bytewax
Feature Repository
Bytewax
Kubernetes
kubeconfig file
Kubernetes secrets
namespace
local feature server
feature_store.yaml
.feastignore
Feature Views
Create a feature repository
feature_store.yaml
.feastignore
Feature Views
helm chart
remote feature server
push sources
Python feature server
[Alpha] Go feature server
[Alpha] AWS Lambda feature server
blog post
docs
Driver Ranking Example
Scikit-learn
BigQuery
SQLite
Firestore
on-demand feature views
push sources
Tutorial: Building streaming features
Slack
GitHub
pull request
quickstart
tutorials
data sources
offline store
GitHub issue
push based ingestion
stream processor
docs
RFC
documentation
here
here
benchmark blog post
reference
https://www.tensorflow.org/guide/sparse_tensor
here
here
roadmap
here
here
AWS tutorial
Running Feast with Snowflake/GCP/AWS
presentation
roadmap
here
here
document
document
roadmap
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
here
functionality matrix
feature views without entities
project: <your project name>
provider: <provider name>
online_store: redis
offline_store: file
registry:
    registry_type: sql
    path: postgresql://postgres:mysecretpassword@127.0.0.1:55001/feast
feature_refs = [
    "driver_trips:average_daily_rides",
    "driver_trips:maximum_daily_rides",
    "driver_trips:rating",
    "driver_trips:rating:trip_completed",
]
import pandas as pd
from datetime import datetime

entity_df = pd.DataFrame(
    {
        "event_timestamp": [pd.Timestamp(datetime.now(), tz="UTC")],
        "driver_id": [1001]
    }
)
entity_df = "SELECT event_timestamp, driver_id FROM my_gcp_project.table"
from feast import FeatureStore

fs = FeatureStore(repo_path="path/to/your/feature/repo")

training_df = fs.get_historical_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate"
    ],
    entity_df=entity_df
).to_df()
feast apply

# Processing example.py as example
# Done!
feast teardown
SQLRegistry
SQLAlchemy
database supported
Database URL
Deploy a feature store
feature_store.yaml
Create a feature store
Load data into the online store
Running Feast in Production

Running Feast in production (e.g. on Kubernetes)

Overview

After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise (e.g. on Kubernetes).

A typical production architecture looks like:

Important note: Feast is highly customizable and modular.

Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.

For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.

In this guide we will show you how to:

  1. Deploy your feature store and keep your infrastructure in sync with your feature repository

  2. Keep the data in your online store up to date (from batch and stream sources)

  3. Use Feast for model training and serving

1. Automatically deploying changes to your feature definitions

1.1 Setting up a feature repository

The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running feast apply commits feature definitions to a registry, which users can then read elsewhere.

1.2 Setting up a database-backed registry

1.3 Setting up CI/CD to automatically update the registry

We recommend typically setting up CI/CD to automatically run feast plan and feast apply when pull requests are opened / merged.

1.4 Setting up multiple environments

A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set).

Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.

2. How to load data into your online store and keep it up to date

To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.

2.1 Scalable Materialization

Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.

The Bytewax materialization engine can run materialization on an existing Kubernetes cluster. An example configuration of this in a feature_store.yaml is as follows:

batch_engine:
  type: bytewax
  namespace: bytewax
  image: bytewax/bytewax-feast:latest
  env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-access-key-id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-secret-access-key

2.2 Scheduled materialization

It is up to you to orchestrate and schedule runs of materialization.

However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.

import datetime
from airflow.operators.python_operator import PythonOperator
from feast import RepoConfig, FeatureStore
from feast.infra.online_stores.dynamodb import DynamoDBOnlineStoreConfig
from feast.repo_config import RegistryConfig

# Define Python callable
def materialize():
  repo_config = RepoConfig(
    registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
    project="feast_demo_aws",
    provider="aws",
    offline_store="file",
    online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
  )
  store = FeatureStore(config=repo_config)
  # Option 1: materialize just one feature view
  # store.materialize_incremental(datetime.datetime.now(), feature_views=["my_fv_name"])
  # Option 2: materialize all feature views incrementally
  store.materialize_incremental(datetime.datetime.now())

# Use Airflow PythonOperator
materialize_python = PythonOperator(
  task_id='materialize_python',
  python_callable=materialize,
)

Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.

2.3 Stream feature ingestion

This supports pushing feature values into Feast to both online or offline stores.

3. How to use Feast for model training

3.1. Generating training data

After we've defined our features and data sources in the repository, we can generate training datasets. We highly recommend you use a FeatureService to version the features that go into a specific model version.

  1. The first thing we need to do in our training code is to create a FeatureStore object with a path to the registry.

    • One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml to those pipelines. This feature_store.yaml file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.

      from feast import FeatureStore
      
      fs = FeatureStore(repo_path="production/")
  2. Then, you need to generate an entity dataframe. You have two options

    • Create an entity dataframe manually and pass it in

    • Use a SQL query to dynamically generate lists of entities (e.g. all entities within a time range) and timestamps to pass into Feast

  3. Then, training data can be retrieved as follows:

    training_retrieval_job = fs.get_historical_features(
        entity_df=entity_df_or_sql_string,
        features=fs.get_feature_service("driver_activity_v1"),
    )
    
    # Option 1: In memory model training
    model = ml.fit(training_retrieval_job.to_df())
    
    # Option 2: Unloading to blob storage. Further post-processing can occur before kicking off distributed training.
    training_retrieval_job.to_remote_storage()

3.2 Versioning features that power ML models

The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the feature service name and the model versions have some established convention.

For example, in MLflow:

import mlflow.pyfunc

# Load model from MLflow
model_name = "my-model"
model_version = 1
model = mlflow.pyfunc.load_model(
    model_uri=f"models:/{model_name}/{model_version}"
)

fs = FeatureStore(repo_path="production/")

# Read online features using the same model name and model version
feature_vector = fs.get_online_features(
    features=fs.get_feature_service(f"{model_name}_v{model_version}"),
    entity_rows=[{"driver_id": 1001}]
).to_dict()

# Make a prediction
prediction = model.predict(feature_vector)

It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.

4. Retrieving online features for prediction

Once you have successfully loaded data from batch / streaming sources into the online store, you can start consuming features for model inference.

4.1. Use the Python SDK within an existing Python service

This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.

To integrate online retrieval into your service use the following code:

from feast import FeatureStore

with open('feature_refs.json', 'r') as f:
    feature_refs = json.loads(f)

fs = FeatureStore(repo_path="production/")

# Read online features
feature_vector = fs.get_online_features(
    features=feature_refs,
    entity_rows=[{"driver_id": 1001}]
).to_dict()

4.2. Deploy Feast feature servers on Kubernetes

Basic steps

  1. Add the Feast Helm repository and download the latest charts:

helm repo add feast-charts https://feast-helm-charts.storage.googleapis.com
helm repo update
  1. Run Helm Install

helm install feast-release feast-charts/feast-feature-server \
    --set feature_store_yaml_base64=$(base64 feature_store.yaml)    

This will deploy a single service. The service must have read access to the registry file on cloud storage. It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.

5. Using environment variables in your yaml configuration

You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the ${ENV_VAR} syntax in your feature_store.yaml file. For instance:

project: my_project
registry: data/registry.db
provider: local
online_store:
    type: redis
    connection_string: ${REDIS_CONNECTION_STRING}

It is possible to set a default value if the environment variable is not set, with ${ENV_VAR:"default"}. For instance:

project: my_project
registry: data/registry.db
provider: local
online_store:
    type: redis
    connection_string: ${REDIS_CONNECTION_STRING:"0.0.0.0:6379"}

Summary

In summary, the overall architecture in production may look like:

  • Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry

  • Data ingestion

    • Batch data: Airflow manages materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine

      • If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.

      • If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.

      • If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).

    • Stream data: The Feast Push API is used within existing Spark / Beam pipelines to push feature values to offline / online stores

  • Online features are served via the Python feature server over HTTP, or consumed using the Feast Python SDK.

  • Feast Python SDK is called locally to generate a training dataset

Run in Google Colab
View Source in Github

Additionally, please check the how-to guide for some specific recommendations on .

Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available .

Different options are presented in the .

This approach may not scale to large amounts of data, which users of Feast may be dealing with in production. In this case, we recommend using one of the more , such as the , or the . Users may also need to to work on their existing infrastructure.

See also for code snippets

Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently.

If you are using Airflow as a scheduler, Feast can be invoked through a after the has been installed into a virtual environment and your feature repo has been synced:

See more details at , which shows how to ingest streaming features or 3rd party feature data via a push API.

For more details, see

To deploy a Feast feature server on Kubernetes, you can use the included (which also has detailed instructions and an example tutorial).

Install and

how to scale Feast
how-to guide
unix cron util
PythonOperator
Python SDK
data ingestion
helm chart + tutorial
kubectl
helm 3
here
Bytewax Materialization Engine
Snowflake Materialization Engine
write a custom materialization engine
scalable materialization engines
data ingestion
feature retrieval

Snowflake

Description

  • All joins happen within Snowflake.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Snowflake as a temporary table in order to complete join operations.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[snowflake]'.

If you're using a file based registry, then you'll also need to install the relevant cloud extra (pip install 'feast[snowflake, CLOUD]' where CLOUD is one of aws, gcp, azure)

You can get started by then running feast init -t snowflake.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: snowflake.offline
  account: snowflake_deployment.us-east-1
  user: user_login
  password: user_password
  role: sysadmin
  warehouse: demo_wh
  database: FEAST

Functionality Matrix

Snowflake

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

Below is a matrix indicating which functionality is supported by SnowflakeRetrievalJob.

Snowflake

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

yes

export to data warehouse

yes

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

The offline store provides support for reading .

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Snowflake offline store.

To compare this set of functionality against other offline stores, please see the full .

Snowflake
SnowflakeSources
SnowflakeOfflineStoreConfig
here
functionality matrix
Demo parquet data: data/driver_stats.parquet
Entity dataframe containing timestamps, driver ids, and the target variable
Feast Architecture Diagram
Ride-hailing data source
Overview
From Repository to Production: Feast Production Architecture

File

Description

All data is downloaded and joined using Python and therefore may not scale to production workloads.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: file

Functionality Matrix

File

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

Below is a matrix indicating which functionality is supported by FileRetrievalJob.

File

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

no

export to data lake (S3, GCS, etc.)

no

export to data warehouse

no

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

The file offline store provides support for reading . It uses Dask as the compute engine.

The full set of configuration options is available in .

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the file offline store.

To compare this set of functionality against other offline stores, please see the full .

FileSources
FileOfflineStoreConfig
here
functionality matrix
feature_store.yaml
feature_store.yaml

feature_store.yaml

Overview

feature_store.yaml
project: loyal_spider
registry: data/registry.db
provider: local
online_store:
    type: sqlite
    path: data/online_store.db

Options

The following top-level configuration options exist in the feature_store.yaml file.

  • provider — Configures the environment in which Feast will deploy and operate.

  • registry — Configures the location of the feature registry.

  • online_store — Configures the online store.

  • offline_store — Configures the offline store.

  • project — Defines a namespace for the entire feature store. Can be used to isolate multiple deployments in a single installation of Feast. Should only contain letters, numbers, and underscores.

  • engine - Configures the batch materialization engine.

feature_store.yaml is used to configure a feature store. The file must be located at the root of a . An example feature_store.yaml is shown below:

Please see the API reference for the full list of configuration options.

feature repository
RepoConfig