Only this pageAll pages
Powered by GitBook
1 of 88

v0.21-branch

Loading...

Loading...

Loading...

Getting started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Tutorials

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

How-to Guides

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Project

Loading...

Loading...

Loading...

Loading...

Loading...

Concepts

Overview
Data source
Entity
Feature view
Feature retrieval
Point-in-time joins
Dataset

Feature view

Feature views

from feast import BigQuerySource, FeatureView, Field
from feast.types import Float32, Int64

driver_stats_fv = FeatureView(
    name="driver_activity",
    entities=["driver"],
    schema=[
        Field(name="trips_today", dtype=Int64),
        Field(name="rating", dtype=Float32),
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.driver_activity"
    )
)

Feature views are used during

  • The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.

  • Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.

Feast does not generate feature values. It acts as the ingestion and serving system. The data sources described within feature views should reference feature values in their already computed form.

Feature views without entities

If a feature view contains features that are not related to a specific entity, the feature view can be defined without entities (only event timestamps are needed for this feature view).

from feast import BigQuerySource, FeatureView, Field
from feast.types import Int64

global_stats_fv = FeatureView(
    name="global_stats",
    entities=[],
    schema=[
        Field(name="total_trips_today_by_all_drivers", dtype=Int64),
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.global_stats"
    )
)

Feature inferencing

If the features parameter is not specified in the feature view creation, Feast will infer the features during feast apply by creating a feature for each column in the underlying data source except the columns corresponding to the entities of the feature view or the columns corresponding to the timestamp columns of the feature view's data source. The names and value types of the inferred features will use the names and data types of the columns from which the features were inferred.

Entity aliasing

"Entity aliases" can be specified to join entity_dataframe columns that do not match the column names in the source table of a FeatureView.

This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.

It is suggested that you dynamically specify the new FeatureView name using .with_name and join_key_map override using .with_join_key_map instead of needing to register each new copy.

from feast import BigQuerySource, Entity, FeatureView, Field, ValueType
from feast.types import Int32

location = Entity(name="location", join_keys=["location_id"], value_type=ValueType.INT64)

location_stats_fv= FeatureView(
    name="location_stats",
    entities=["location"],
    schema=[
        Field(name="temperature", dtype=Int32)
    ],
    source=BigQuerySource(
        table="feast-oss.demo_data.location_stats"
    ),
)
from location_stats_feature_view import location_stats_fv

temperatures_fs = FeatureService(
    name="temperatures",
    features=[
        location_stats_fv
            .with_name("origin_stats")
            .with_join_key_map(
                {"location_id": "origin_id"}
            ),
        location_stats_fv
            .with_name("destination_stats")
            .with_join_key_map(
                {"location_id": "destination_id"}
            ),
    ],
)

Feature

A feature is an individual measurable property. It is typically a property observed on a specific entity, but does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of posts made by all users in the last month.

Features are defined as part of feature views. Since Feast does not transform data, a feature is essentially a schema that only contains a name and a type:

from feast import Field
from feast.types import Float32

trips_today = Field(
    name="trips_today",
    dtype=Float32
)

[Alpha] On demand feature views

On demand feature views allows users to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths:

from feast import Field, RequestSource
from feast.types import Float64

# Define a request data source which encodes features / information only
# available at request time (e.g. part of the user initiated HTTP request)
input_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name="val_to_add", dtype=PrimitiveFeastType.INT64),
        Field(name="val_to_add_2": dtype=PrimitiveFeastType.INT64),
    ]
)

# Use the input data and feature view features to create new features
@on_demand_feature_view(
   sources=[
       driver_hourly_stats_view,
       input_request
   ],
   schema=[
     Field(name='conv_rate_plus_val1', dtype=Float64),
     Field(name='conv_rate_plus_val2', dtype=Float64)
   ]
)
def transformed_conv_rate(features_df: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df['conv_rate_plus_val1'] = (features_df['conv_rate'] + features_df['val_to_add'])
    df['conv_rate_plus_val2'] = (features_df['conv_rate'] + features_df['val_to_add_2'])
    return df

A feature view is an object that represents a logical group of time-series feature data as it is found in a . Feature views consist of zero or more , one or more , and a . Feature views allow Feast to model your existing feature data in a consistent way in both an offline (training) and online (serving) environment. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities; see below.

Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from .

Together with , they indicate to Feast where to find your feature values, e.g., in a specific parquet file or BigQuery table. Feature definitions are also used when reading features from the feature store, using .

Feature names must be unique within a .

stream sources
data source
entities
data source
features
feature views without entities
feature view
data sources

Roadmap

The list below contains the functionality that contributors are planning to develop for Feast

  • Items below that are in development (or planned for development) will be indicated in parentheses.

  • We welcome contribution to all items in the roadmap!

  • Data Sources

  • Offline Stores

  • Online Stores

  • Streaming

  • Feature Engineering

  • Deployments

  • Feature Serving

  • Feature Discovery and Governance

Want to influence our roadmap and prioritization? Submit your feedback to .

Want to speak to a Feast contributor? We are more than happy to jump on a call. Please schedule a time using .

Kafka / Kinesis sources (via )

On-demand Transformations (Alpha release. See )

Batch transformation (In progress. See )

AWS Lambda (Alpha release. See )

Kubernetes (See )

REST Feature Server (Python) (Alpha release. See )

gRPC Feature Server (Java) (See )

Data Quality Management (See )

Amundsen integration (see )

Feast Web UI (Alpha release. See )

this form
Calendly
Snowflake source
Redshift source
BigQuery source
Parquet file source
Synapse source (community plugin)
Hive (community plugin)
Postgres (contrib plugin)
Spark (contrib plugin)
push support into the online store
Snowflake
Redshift
BigQuery
Synapse (community plugin)
Hive (community plugin)
Postgres (contrib plugin)
Trino (contrib plugin)
Spark (contrib plugin)
In-memory / Pandas
Custom offline store support
DynamoDB
Redis
Datastore
SQLite
Azure Cache for Redis (community plugin)
Postgres (contrib plugin)
Custom online store support
Custom streaming ingestion job support
Push based streaming data ingestion
RFC
RFC
RFC
guide
RFC
#1497
RFC
Feast extractor
documentation

Quickstart

In this tutorial we will

  1. Deploy a local feature store with a Parquet file offline store and Sqlite online store.

  2. Build a training dataset using our time series features from our Parquet files.

  3. Materialize feature values from the offline store into the online store.

  4. Read the latest features from the online store for inference.

You can run this tutorial in Google Colab or run it on your localhost, following the guided steps below.

Overview

In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:

  1. Training-serving skew and complex data joins: Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.

    • Feast joins these tables with battle-tested logic that ensures point-in-time correctness so future feature values do not leak to models.

    • Feast alerts users to offline / online skew with data quality monitoring

  2. Online feature availability: At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources.

    • Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently available and freshly computed at inference time.

  3. Feature reusability and model versioning: Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.

    • Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via feature services).

    • Feast enables feature transformation so users can re-use transformation logic across online / offline usecases and across models.

Step 1: Install Feast

Install the Feast SDK and CLI using pip:

pip install feast

Step 2: Create a feature repository

Bootstrap a new feature repository using feast init from the command line.

feast init feature_repo
cd feature_repo
Creating a new Feast repository in /home/Jovyan/feature_repo.

Let's take a look at the resulting demo repo itself. It breaks down into

  • data/ contains raw demo parquet data

  • example.py contains demo feature definitions

  • feature_store.yaml contains a demo setup configuring where data sources are

project: my_project
registry: data/registry.db
provider: local
online_store:
    path: data/online_store.db
# This is an example feature definition file

from datetime import timedelta

from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
    path="/content/feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
# Entity has a name used for later reference (in a feature view, eg)
# and join_key to identify physical field name used in storages
driver = Entity(name="driver", value_type=ValueType.INT64, join_keys=["driver_id"], description="driver id",)

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver"],  # reference entity by name
    ttl=timedelta(seconds=86400 * 1),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_hourly_stats,
    tags={},
)

driver_stats_fs = FeatureService(
    name="driver_activity",
    features=[driver_hourly_stats_view]
)

The key line defining the overall architecture of the feature store is the provider. This defines where the raw data exists (for generating training data & feature values for serving), and where to materialize feature values to in the online store (for serving).

Valid values for provider in feature_store.yaml are:

  • local: use file source with SQLite/Redis

  • gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis

  • aws: use Redshift/Snowflake with DynamoDB/Redis

Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

import pandas as pd
pd.read_parquet("data/driver_stats.parquet")

Step 3: Register feature definitions and deploy your feature store

The apply command scans python files in the current directory for feature view/entity definitions, registers the objects, and deploys infrastructure. In this example, it reads example.py (shown again below for convenience) and sets up SQLite online store tables. Note that we had specified SQLite as the default online store by using the local provider in feature_store.yaml.

feast apply
# This is an example feature definition file

from datetime import timedelta

from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
    path="/content/feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
# Entity has a name used for later reference (in a feature view, eg)
# and join_key to identify physical field name used in storages
driver = Entity(name="driver", value_type=ValueType.INT64, join_keys=["driver_id"], description="driver id",)

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver"],  # reference entity by name
    ttl=timedelta(seconds=86400 * 1),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_hourly_stats,
    tags={},
)

driver_stats_fs = FeatureService(
    name="driver_activity",
    features=[driver_hourly_stats_view]
)
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats

Step 4: Generating training data

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values).

The user can query that table of labels with timestamps and pass that into Feast as an entity dataframe for training data generation. In many cases, Feast will also intelligently join relevant tables to create the relevant feature vectors.

  • Note that we include timestamps because want the features for the same driver at various timestamps to be used in a model.

from datetime import datetime, timedelta
import pandas as pd

from feast import FeatureStore

# The entity dataframe is the dataframe we want to enrich with feature values
entity_df = pd.DataFrame.from_dict(
    {
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],

        # label name -> label values
        "label_driver_reported_satisfaction": [1, 5, 3],

        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime.now() - timedelta(minutes=11),
            datetime.now() - timedelta(minutes=36),
            datetime.now() - timedelta(minutes=73),
        ],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())
----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   event_timestamp                     3 non-null      datetime64[ns, UTC]
 1   driver_id                           3 non-null      int64
 2   label_driver_reported_satisfaction  3 non-null      int64
 3   conv_rate                           3 non-null      float32
 4   acc_rate                            3 non-null      float32
 5   avg_daily_trips                     3 non-null      int32
dtypes: datetime64[ns, UTC](1), float32(2), int32(1), int64(2)
memory usage: 132.0 bytes
None

----- Example features -----

                   event_timestamp  driver_id  ...  acc_rate  avg_daily_trips
0 2021-08-23 15:12:55.489091+00:00       1003  ...  0.120588              938
1 2021-08-23 15:49:55.489089+00:00       1002  ...  0.504881              635
2 2021-08-23 16:14:55.489075+00:00       1001  ...  0.138416              606

[3 rows x 6 columns]

Step 5: Load features into your online store

We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental serializes all new features since the last materialize call).

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
Materializing 1 feature views to 2021-08-23 16:25:46+00:00 into the sqlite online
store.

driver_hourly_stats from 2021-08-22 16:25:47+00:00 to 2021-08-23 16:25:46+00:00:
100%|████████████████████████████████████████████| 5/5 [00:00<00:00, 592.05it/s]

Step 6: Fetching feature vectors for inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using get_online_features(). These feature vectors can then be fed to the model.

from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

pprint(feature_vector)
{
 'acc_rate': [0.5732735991477966, 0.7828438878059387],
 'avg_daily_trips': [33, 984],
 'conv_rate': [0.15498852729797363, 0.6263588070869446],
 'driver_id': [1004, 1005]
}

Step 7: Using a feature service to fetch online features instead.

from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity")
features = feature_store.get_online_features(
    features=feature_service,
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()
{
 'acc_rate': [0.5732735991477966, 0.7828438878059387],
 'avg_daily_trips': [33, 984],
 'conv_rate': [0.15498852729797363, 0.6263588070869446],
 'driver_id': [1004, 1005]
}

Step 8: Browse your features with the Web UI (experimental)

View all registered features, data sources, entities, and feature services with the Web UI.

One of the ways to view this is with the feast ui command.

Next steps

Architecture

Community

Links & Resources

    • Design proposals in the form of Request for Comments (RFC).

    • User surveys and meeting minutes.

    • Slide decks of conferences our contributors have spoken at.

How can I get help?

  • Slack: Need to speak to a human? Come ask a question in our Slack channel (link above).

Community Calls

We have a user and contributor community call every two weeks (Asia & US friendly).

Please join the above Feast user groups in order to see calendar invites to the community calls

Frequency (every 2 weeks)

  • Tuesday 10:00 am to 10:30 am PST

Links

Entity

An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.

driver = Entity(name='driver', value_type=ValueType.STRING, join_keys=['driver_id'])

Entities should be reused across feature views.

Entity key

A related concept is an entity key. These are one or more entity values that uniquely describe a feature view record. In the case of an entity (like a driver) that only has a single entity field, the entity is an entity key. However, it is also possible for an entity key to consist of multiple entity values. For example, a feature view with the composite entity of (customer, country) might have an entity key of (1001, 5).

Entity keys act as primary keys. They are used during the lookup of features from the online store, and they are also used to match feature rows across feature views during point-in-time joins.

Point-in-time joins

Feature values in Feast are modeled as time-series records. Below is an example of a driver feature view with two feature columns (trips_today, and earnings_today):

The above table can be registered with Feast through the following feature view:

Feast is able to join features from one or more feature views onto an entity dataframe in a point-in-time correct way. This means Feast is able to reproduce the state of features at a specific point in the past.

Given the following entity dataframe, imagine a user would like to join the above driver_hourly_stats feature view onto it, while preserving the trip_success column:

The timestamps within the entity dataframe above are the events at which we want to reproduce the state of the world (i.e., what the feature values were at those specific points in time). In order to do a point-in-time join, a user would load the entity dataframe and run historical retrieval:

For each row within the entity dataframe, Feast will query and join the selected features from the appropriate feature view data source. Feast will scan backward in time from the entity dataframe timestamp up to a maximum of the TTL time.

Please note that the TTL time is relative to each timestamp within the entity dataframe. TTL is not relative to the current point in time (when you run the query).

Below is the resulting joined training dataframe. It contains both the original entity rows and joined feature values:

Three feature rows were successfully joined to the entity dataframe rows. The first row in the entity dataframe was older than the earliest feature rows in the feature view and could not be joined. The last row in the entity dataframe was outside of the TTL window (the event happened 11 hours after the feature row) and also couldn't be joined.

Registry

The Feast feature registry is a central catalog of all the feature definitions and their related metadata. It allows data scientists to search, discover, and collaborate on new features.

Each Feast deployment has a single feature registry. Feast only supports file-based registries today, but supports three different backends

  • Local: Used as a local backend for storing the registry during development

  • S3: Used as a centralized backend for storing the registry on AWS

  • GCS: Used as a centralized backend for storing the registry on GCP

The feature registry is updated during different operations when using Feast. More specifically, objects within the registry (entities, feature views, feature services) are updated when running apply from the Feast CLI, but metadata about objects can also be updated during operations like materialization.

Users interact with a feature registry through the Feast SDK. Listing all feature views:

Or retrieving a specific feature view:

Offline store

Feast uses offline stores as storage and compute systems. Offline stores store historic time-series feature values. Feast does not generate these features, but instead uses the offline store as the interface for querying existing features in your organization.

Offline stores are used primarily for two reasons

  1. Building training datasets from time-series features.

  2. Materializing (loading) features from the offline store into an online store in order to serve those features at low latency for prediction.

It is not possible to query all data sources from all offline stores, and only a single offline store can be used at a time. For example, it is not possible to query a BigQuery table from a File offline store, nor is it possible for a BigQuery offline store to query files from your local file system.

In this tutorial, we focus on a local deployment. For a more in-depth guide on how to use Feast with Snowflake / GCP / AWS deployments, see

Note that there are many other sources Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See for all supported datasources.

A custom setup can also be made by following .

You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same api below. More information can be found .

Read the page to understand the Feast data model.

Read the page.

Check out our section for more examples on how to use Feast.

Follow our guide for a more in-depth tutorial on using Feast.

Join other Feast users and contributors in and become part of the community!

Speak to us: Have a question, feature request, idea, or just looking to speak to a real person? Set up a meeting with a Feast maintainer over !

: Feel free to ask questions or say hello!

: We have both a user and developer mailing list.

Feast users should join group by clicking .

Feast developers should join group by clicking .

People interested in the Feast community newsletter should join feast-announce by clicking .

: Includes community calls and design meetings.

: This folder is used as a central repository for all Feast resources. For example:

: Find the complete Feast codebase on GitHub.

: Our LFAI wiki page contains links to resources for contributors and maintainers.

GitHub Issues: Found a bug or need a feature? .

StackOverflow: Need to ask a question on how to use Feast? We also monitor and respond to .

Zoom:

Meeting notes (incl recordings):

Entities are typically defined as part of feature views. Entity name is used to reference the entity from a feature view definition and join key is used to identify the physical primary key on which feature values should be stored and retrieved. These keys are used during the lookup of feature values from the online store and the join process in point-in-time joins. It is possible to define composite entities (more than one entity object) in a feature view. It is also possible for feature views to have zero entities. See for more details.

The feature registry is a of Feast metadata. This Protobuf file can be read programmatically from other programming languages, but no compatibility guarantees are made on the internal structure of the registry.

Offline stores are configured through the . When building training datasets or materializing features into an online store, Feast will use the configured offline store along with the data sources you have defined as part of feature views to execute the necessary data operations.

Please see the reference for more details on configuring offline stores.

Running Feast with Snowflake/GCP/AWS
Third party integrations
adding a custom provider
here
Concepts
Architecture
Tutorials
Running Feast with Snowflake/GCP/AWS
Slack
Overview
Feature repository
Registry
Offline store
Online store
Provider
here
Slack
Mailing list
feast-discuss@googlegroups.com
here
feast-dev@googlegroups.com
here
here
Community Calendar
Google Folder
Feast GitHub Repository
Feast Linux Foundation Wiki
Create an issue on GitHub
StackOverflow
https://zoom.us/j/6325193230
https://bit.ly/feast-notes
feature view
from feast import FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

driver_stats_fv = FeatureView(
    name="driver_hourly_stats",
    entities=["driver"],
    schema=[
        Field(name="trips_today", dtype=Int64),
        Field(name="earnings_today", dtype=Float32),
    ],
    ttl=timedelta(hours=2),
    source=FileSource(
        path="driver_hourly_stats.parquet"
    )
)
# Read in entity dataframe
entity_df = pd.read_csv("entity_df.csv")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features = [
        'driver_hourly_stats:trips_today',
        'driver_hourly_stats:earnings_today'
    ],
)
fs = FeatureStore("my_feature_repo/")
print(fs.list_feature_views())
fs = FeatureStore("my_feature_repo/")
fv = fs.get_feature_view(“my_fv1”)
Protobuf representation
feature_store.yaml
Offline Stores

Dataset

Dataset can be created from:

  1. Results of historical retrieval

  2. [planned] Logging features during writing to online store (from batch source or stream)

Creating Saved Dataset from Historical Retrieval

To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features method first and then pass the returned retrieval job to create_saved_dataset method. create_saved_dataset will trigger provided retrieval job (by calling .persist() on it) to store the data using specified storage. Storage type must be the same as globally configured offline store (eg, it's impossible to persist data to Redshift with BigQuery source). create_saved_dataset will also create SavedDataset object with all related metadata and will write it to the registry.

from feast import FeatureStore
from feast.infra.offline_stores.bigquery_source import SavedDatasetBigQueryStorage

store = FeatureStore()

historical_job = store.get_historical_features(
    features=["driver:avg_trip"],
    entity_df=...,
)

dataset = store.create_saved_dataset(
    from_=historical_job,
    name='my_training_dataset',
    storage=SavedDatasetBigQueryStorage(table_ref='<gcp-project>.<gcp-dataset>.my_training_dataset'),
    tags={'author': 'oleksii'}
)

dataset.to_df()

Saved dataset can be later retrieved using get_saved_dataset method:

dataset = store.get_saved_dataset('my_training_dataset')
dataset.to_df()

FAQ

Don't see your question?

Getting started

Do you have any examples of how Feast should be used?

Concepts

Do feature views have to include entities?

How does Feast handle model or feature versioning?

Feast expects that each version of a model corresponds to a different feature service.

Feature views once they are used by a feature service are intended to be immutable and not deleted (until a feature service is removed). In the future, feast plan and `feast apply will throw errors if it sees this kind of behavior.

What is the difference between data sources and the offline store?

Is it possible to have offline and online stores from different providers?

Yes, this is possible. For example, you can use BigQuery as an offline store and Redis as an online store.

Functionality

How do I run get_historical_features without providing an entity dataframe?

Does Feast provide security or access control?

Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).

It is a good idea though to lock down the registry file so only the CI/CD pipeline can modify it. That way data scientists and other users cannot accidentally modify the registry and lose other team's data.

Does Feast support streaming sources?

Does Feast support feature transformation?

There are several kinds of transformations:

    • These transformations are Pandas transformations run on batch data when you call get_historical_features and at online serving time when you call `get_online_features.

    • Note that if you use push sources to ingest streaming features, these transformations will execute on the fly as well

    • These will include SQL + PySpark based transformations on batch data sources.

  • Streaming transformations (RFC in progress)

Does Feast have a Web UI?

Does Feast support composite keys?

A feature view can be defined with multiple entities. Since each entity has a unique join_key, using multiple entities will achieve the effect of a composite key.

How does Feast compare with Tecton?

What are the performance/latency characteristics of Feast?

Does Feast support embeddings and list features?

Yes. Specifically:

  • Simple lists / dense embeddings:

    • BigQuery supports list types natively

    • Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)

  • Sparse embeddings (e.g. one hot encodings)

Does Feast support X storage engine?

Does Feast support using different clouds for offline vs online stores?

Yes. Using a GCP or AWS provider in feature_store.yaml primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.

How can I add a custom online store?

Can the same storage engine be used for both the offline and online store?

Yes. For example, the Postgres connector can be used as both an offline and online store (as well as the registry).

Does Feast support S3 as a data source?

Yes. There are two ways to use S3 in Feast:

  • Using the s3_endpoint_override in a FileSource data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.

How can I use Spark with Feast?

Is Feast planning on supporting X functionality?

Project

How do I contribute to Feast?

Feast 0.9 (legacy)

What is the difference between Feast 0.9 and Feast 0.10+?

How do I migrate from Feast 0.9 to Feast 0.10+?

What are the plans for Feast Core, Feast Serving, and Feast Spark?

Third party integrations

We integrate with a wide set of tools and technologies so you can make Feast work in your existing stack. Many of these integrations are maintained as plugins to the main Feast repo.

Don't see your offline store or online store of choice here? Check out our guides to make a custom one!

Integrations

Data Sources

Offline Stores

Online Stores

Deployments

Standards

In order for a plugin integration to be highlighted on this page, it must meet the following requirements:

  1. The plugin must have some basic documentation on how it should be used.

  2. The author must work with a maintainer to pass a basic code review (e.g. to ensure that the implementation roughly matches the core Feast implementations).

In order for a plugin integration to be merged into the main Feast repo, it must meet the following requirements:

  1. The PR must pass all integration tests. The universal tests (tests specifically designed for custom integrations) must be updated to test the integration.

  2. There is documentation and a tutorial on how to use the integration.

  3. The author (or someone else) agrees to take ownership of all the files, and maintain those files going forward.

  4. If the plugin is being contributed by an organization, and not an individual, the organization should provide the infrastructure (or credits) for integration tests.

Online store

The Feast online store is used for low-latency online feature value lookups. Feature values are loaded into the online store from data sources in feature views using the materialize command.

The storage schema of features within the online store mirrors that of the data source used to populate the online store. One key difference between the online store and data sources is that only the latest feature values are stored per entity key. No historical values are stored.

Example batch data source

Once the above data source is materialized into Feast (using feast materialize), the feature values will be stored as follows:

Driver ranking

Making a prediction using a linear regression model is a common use case in ML. This model predicts if a driver will complete a trip based on features ingested into Feast.

In this example, you'll learn how to use some of the key functionality in Feast. The tutorial runs in both local mode and on the Google Cloud Platform (GCP). For GCP, you must have access to a GCP project already, including read and write permissions to BigQuery.

Try it and let us know what you think!

Overview

Project

Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev, staging, prod).

Projects are currently being supported for backward compatibility reasons. Projects may change in the future as we simplify the Feast API.

Run in Google Colab

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

[planned] Logging request (including input for ) and response during feature serving

Check out our to see how this concept can be applied in real-world use case.

We encourage you to ask questions on or . Even better, once you get an answer, add the answer to this FAQ via a !

The is the easiest way to learn about Feast. For more detailed tutorials, please check out the page.

No, there are .

The data source itself defines the underlying data warehouse table in which the features are stored. The offline store interface defines the APIs required to make an arbitrary compute layer work for Feast (e.g. pulling features given a set of feature views from their sources, exporting the data set results to different formats). Please see and for more details.

Feast does not provide a way to do this right now. This is an area we're actively interested in contributions for. See

Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support . Streaming transformations are actively being worked on.

On demand transformations (See )

Batch transformations (WIP, see )

Yes. See .

Please see a detailed comparison of Feast vs. Tecton . For another comparison, please see .

Feast is designed to work at scale and support low latency online serving. See our for details.

Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see )

One way to do this efficiently is to have a protobuf or string representation of

The list of supported offline and online stores can be found and , respectively. The indicates the stores for which we are planning to add support. Finally, our Provider abstraction is built to be extensible, so you can plug in your own implementations of offline and online stores. Please see more details about custom providers .

Please follow the instructions .

Using Redshift as a data source via Spectrum (), and then continuing with the guide. See a we did on this at our apply() meetup.

Feast supports ingestion via Spark (See ) does not support Spark natively. However, you can create a that will support Spark, which can help with more scalable materialization and ingestion.

Please see the .

For more details on contributing to the Feast community, see and this .

Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this for more details.

Please see this . If you have any questions or suggestions, feel free to leave a comment on the document!

Feast Core and Feast Serving were both part of Feast Java. We plan to support Feast Serving. We will not support Feast Core; instead we will support our object store based registry. We will not support Feast Spark. For more details on what we plan on supporting, please see the .

Kafka / Kinesis sources (via )

AWS Lambda (Alpha release. See and )

Kubernetes (See )

The plugin must have tests. Ideally it would use the Feast universal tests (see this for an example), but custom tests are fine.

Features can also be written to the online store via

This tutorial guides you on how to use Feast with . You will learn how to:

Train a model locally (on your laptop) using data from

Test the model for online inference using (for fast iteration)

Test the model for online inference using (for production use)

The top-level namespace within Feast is a . Users define one or more within a project. Each feature view contains one or more . These features typically relate to one or more . A feature view must always have a , which in turn is used during the generation of training and when materializing feature values into the online store.

Data Quality Monitoring
offline store
on demand transformation
tutorial on validating historical features
Slack
GitHub
pull request
quickstart
tutorials
data sources
offline store
GitHub issue
push based ingestion
docs
RFC
documentation
here
here
benchmark blog post
reference
https://www.tensorflow.org/guide/sparse_tensor
here
here
roadmap
here
here
AWS tutorial
Running Feast with Snowflake/GCP/AWS
presentation
custom provider
roadmap
here
here
document
document
roadmap
Adding a new offline store
Adding a new online store
Snowflake source
Redshift source
BigQuery source
Parquet file source
Synapse source (community plugin)
Hive (community plugin)
Postgres (contrib plugin)
Spark (contrib plugin)
push support into the online store
Snowflake
Redshift
BigQuery
Synapse (community plugin)
Hive (community plugin)
Postgres (contrib plugin)
Trino (contrib plugin)
Spark (contrib plugin)
In-memory / Pandas
Custom offline store support
DynamoDB
Redis
Datastore
SQLite
Azure Cache for Redis (community plugin)
Postgres (contrib plugin)
Custom online store support
guide
RFC
guide
guide
push sources
Driver Ranking Example
Scikit-learn
BigQuery
SQLite
Firestore
feature views without entities
feature references
feature views
entities
data source
features
project
datasets

Introduction

What is Feast?

Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Feast is able to serve feature data to models from a low-latency online store (for real-time prediction) or from an offline store (for scale-out batch scoring or model training).

Problems Feast Solves

Models need consistent access to data: Machine Learning (ML) systems built on traditional data infrastructure are often coupled to databases, object stores, streams, and files. A result of this coupling, however, is that any change in data infrastructure may break dependent ML systems. Another challenge is that dual implementations of data retrieval for training and serving can lead to inconsistencies in data, which in turn can lead to training-serving skew.

Feast decouples your models from your data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval. Feast also provides a consistent means of referencing feature data for retrieval, and therefore ensures that models remain portable when moving from training to serving.

Deploying new features into production is difficult: Many ML teams consist of members with different objectives. Data scientists, for example, aim to deploy features into production as soon as possible, while engineers want to ensure that production systems remain stable. These differing objectives can create an organizational friction that slows time-to-market for new features.

Feast addresses this friction by providing both a centralized registry to which data scientists can publish features and a battle-hardened serving layer. Together, these enable non-engineering teams to ship features into production with minimal oversight.

Models need point-in-time correct data: ML models in production require a view of data consistent with the one on which they are trained, otherwise the accuracy of these models could be compromised. Despite this need, many data science projects suffer from inconsistencies introduced by future feature values being leaked to models during training.

Feast solves the challenge of data leakage by providing point-in-time correct feature retrieval when exporting feature datasets for model training.

Features aren't reused across projects: Different teams within an organization are often unable to reuse features across projects. The siloed nature of development and the monolithic design of end-to-end ML systems contribute to duplication of feature creation and usage across teams and projects.

Feast addresses this problem by introducing feature reuse through a centralized registry. This registry enables multiple teams working on different projects not only to contribute features, but also to reuse these same features. With Feast, data scientists can start new ML projects by selecting previously engineered features from a centralized registry, and are no longer required to develop new features for each project.

Problems Feast does not yet solve

Feature engineering: We aim for Feast to support light-weight feature engineering as part of our API.

Feature discovery: We also aim for Feast to include a first-class user interface for exploring and discovering entities and features.

Feature validation: We additionally aim for Feast to improve support for statistics generation of feature data and subsequent validation of these statistics. Current support is limited.

What Feast is not

Data warehouse: Feast is not a replacement for your data warehouse or the source of truth for all transformed data in your organization. Rather, Feast is a light-weight downstream layer that can serve data from an existing data warehouse (or other data sources) to models in production.

Data catalog: Feast is not a general purpose data catalog for your organization. Feast is purely focused on cataloging features for use in ML pipelines or systems, and only to the extent of facilitating the reuse of features.

How can I get started?

Explore the following resources to get started with Feast:

Validating historical features with Great Expectations

In this tutorial, we will use the public dataset of Chicago taxi trips to present data validation capabilities of Feast.

  • The original dataset is stored in BigQuery and consists of raw data for each taxi trip (one row per trip) since 2013.

  • We will generate several training datasets (aka historical features in Feast) for different periods and evaluate expectations made on one dataset against another.

Types of features we're ingesting and generating:

  • Features that aggregate raw data with daily intervals (eg, trips per day, average fare or speed for a specific day, etc.).

  • Features using SQL while pulling data from BigQuery (like total trips time or total miles travelled).

  • Features calculated on the fly when requested using Feast's on-demand transformations

Our plan:

  1. Prepare environment

  2. Pull data from BigQuery (optional)

  3. Declare & apply features and feature views in Feast

  4. Generate reference dataset

  5. Develop & test profiler function

  6. Run validation on different dataset using reference dataset & profiler

0. Setup

Install Feast Python SDK and great expectations:

!pip install 'feast[ge]'

1. Dataset preparation (Optional)

You can skip this step if you don't have GCP account. Please use parquet files that are coming with this tutorial instead

!pip install google-cloud-bigquery
import pyarrow.parquet

from google.cloud.bigquery import Client
bq_client = Client(project='kf-feast')

Running some basic aggregations while pulling data from BigQuery. Grouping by taxi_id and day:

data_query = """SELECT
    taxi_id,
    TIMESTAMP_TRUNC(trip_start_timestamp, DAY) as day,
    SUM(trip_miles) as total_miles_travelled,
    SUM(trip_seconds) as total_trip_seconds,
    SUM(fare) as total_earned,
    COUNT(*) as trip_count
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
    trip_miles > 0 AND trip_seconds > 60 AND
    trip_start_timestamp BETWEEN '2019-01-01' and '2020-12-31' AND
    trip_total < 1000
GROUP BY taxi_id, TIMESTAMP_TRUNC(trip_start_timestamp, DAY)"""
driver_stats_table = bq_client.query(data_query).to_arrow()

# Storing resulting dataset into parquet file
pyarrow.parquet.write_table(driver_stats_table, "trips_stats.parquet")
def entities_query(year):
    return f"""SELECT
    distinct taxi_id
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
    trip_miles > 0 AND trip_seconds > 0 AND
    trip_start_timestamp BETWEEN '{year}-01-01' and '{year}-12-31'
"""
entities_2019_table = bq_client.query(entities_query(2019)).to_arrow()

# Storing entities (taxi ids) into parquet file
pyarrow.parquet.write_table(entities_2019_table, "entities.parquet")

2. Declaring features

import pyarrow.parquet
import pandas as pd

from feast import FeatureView, Entity, FeatureStore, Field, BatchFeatureView
from feast.types import Float64, Int64
from feast.value_type import ValueType
from feast.data_format import ParquetFormat
from feast.on_demand_feature_view import on_demand_feature_view
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.file import SavedDatasetFileStorage
from datetime import timedelta
batch_source = FileSource(
    timestamp_field="day",
    path="trips_stats.parquet",  # using parquet file that we created on previous step
    file_format=ParquetFormat()
)
taxi_entity = Entity(name='taxi', join_keys=['taxi_id'])
trips_stats_fv = BatchFeatureView(
    name='trip_stats',
    entities=['taxi'],
    features=[
        Field(name="total_miles_travelled", dtype=Float64),
        Field(name="total_trip_seconds", dtype=Float64),
        Field(name="total_earned", dtype=Float64),
        Field(name="trip_count", dtype=Int64),

    ],
    ttl=timedelta(seconds=86400),
    source=batch_source,
)
@on_demand_feature_view(
    schema=[
        Field("avg_fare", Float64),
        Field("avg_speed", Float64),
        Field("avg_trip_seconds", Float64),
        Field("earned_per_hour", Float64),
    ],
    sources=[
      trips_stats_fv,
    ]
)
def on_demand_stats(inp):
    out = pd.DataFrame()
    out["avg_fare"] = inp["total_earned"] / inp["trip_count"]
    out["avg_speed"] = 3600 * inp["total_miles_travelled"] / inp["total_trip_seconds"]
    out["avg_trip_seconds"] = inp["total_trip_seconds"] / inp["trip_count"]
    out["earned_per_hour"] = 3600 * inp["total_earned"] / inp["total_trip_seconds"]
    return out
store = FeatureStore(".")  # using feature_store.yaml that stored in the same directory
store.apply([taxi_entity, trips_stats_fv, on_demand_stats])  # writing to the registry

3. Generating training (reference) dataset

taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

Generating range of timestamps with daily frequency:

timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-06-01", "2019-07-01", freq='D')

Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:

entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df
taxi_id
event_timestamp

0

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-01

1

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-02

2

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-03

3

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-04

4

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2019-06-05

...

...

...

156979

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-27

156980

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-28

156981

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-29

156982

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-06-30

156983

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2019-07-01

156984 rows × 2 columns

Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:

job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)

store.create_saved_dataset(
    from_=job,
    name='my_training_ds',
    storage=SavedDatasetFileStorage(path='my_training_ds.parquet')
)
<SavedDataset(name = my_training_ds, features = ['trip_stats:total_miles_travelled', 'trip_stats:total_trip_seconds', 'trip_stats:total_earned', 'trip_stats:trip_count', 'on_demand_stats:avg_fare', 'on_demand_stats:avg_trip_seconds', 'on_demand_stats:avg_speed', 'on_demand_stats:earned_per_hour'], join_keys = ['taxi_id'], storage = <feast.infra.offline_stores.file_source.SavedDatasetFileStorage object at 0x1276e7950>, full_feature_names = False, tags = {}, _retrieval_job = <feast.infra.offline_stores.file.FileRetrievalJob object at 0x12716fed0>, min_event_timestamp = 2019-06-01 00:00:00, max_event_timestamp = 2019-07-01 00:00:00)>

4. Developing dataset profiler

Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.

Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.

import numpy as np

from feast.dqm.profilers.ge_profiler import ge_profiler

from great_expectations.core.expectation_suite import ExpectationSuite
from great_expectations.dataset import PandasDataset

Loading saved dataset first and exploring the data:

ds = store.get_saved_dataset('my_training_ds')
ds.to_df()
total_earned
avg_trip_seconds
taxi_id
total_miles_travelled
trip_count
earned_per_hour
event_timestamp
total_trip_seconds
avg_fare
avg_speed

0

68.25

2270.000000

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

24.70

2.0

54.118943

2019-06-01 00:00:00+00:00

4540.0

34.125000

19.585903

1

221.00

560.500000

7a4a6162eaf27805aef407d25d5cb21fe779cd962922cb...

54.18

24.0

59.143622

2019-06-01 00:00:00+00:00

13452.0

9.208333

14.499554

2

160.50

1010.769231

f4c9d05b215d7cbd08eca76252dae51cdb7aca9651d4ef...

41.30

13.0

43.972603

2019-06-01 00:00:00+00:00

13140.0

12.346154

11.315068

3

183.75

697.550000

c1f533318f8480a59173a9728ea0248c0d3eb187f4b897...

37.30

20.0

47.415956

2019-06-01 00:00:00+00:00

13951.0

9.187500

9.625116

4

217.75

1054.076923

455b6b5cae6ca5a17cddd251485f2266d13d6a2c92f07c...

69.69

13.0

57.206451

2019-06-01 00:00:00+00:00

13703.0

16.750000

18.308692

...

...

...

...

...

...

...

...

...

...

...

156979

38.00

1980.000000

0cccf0ec1f46d1e0beefcfdeaf5188d67e170cdff92618...

14.90

1.0

69.090909

2019-07-01 00:00:00+00:00

1980.0

38.000000

27.090909

156980

135.00

551.250000

beefd3462e3f5a8e854942a2796876f6db73ebbd25b435...

28.40

16.0

55.102041

2019-07-01 00:00:00+00:00

8820.0

8.437500

11.591837

156981

NaN

NaN

9a3c52aa112f46cf0d129fafbd42051b0fb9b0ff8dcb0e...

NaN

NaN

NaN

2019-07-01 00:00:00+00:00

NaN

NaN

NaN

156982

63.00

815.000000

08308c31cd99f495dea73ca276d19a6258d7b4c9c88e43...

19.96

4.0

69.570552

2019-07-01 00:00:00+00:00

3260.0

15.750000

22.041718

156983

NaN

NaN

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

NaN

NaN

NaN

2019-07-01 00:00:00+00:00

NaN

NaN

NaN

156984 rows × 10 columns

DELTA = 0.1  # controlling allowed window in fraction of the value on scale [0, 1]

@ge_profiler
def stats_profiler(ds: PandasDataset) -> ExpectationSuite:
    # simple checks on data consistency
    ds.expect_column_values_to_be_between(
        "avg_speed",
        min_value=0,
        max_value=60,
        mostly=0.99  # allow some outliers
    )

    ds.expect_column_values_to_be_between(
        "total_miles_travelled",
        min_value=0,
        max_value=500,
        mostly=0.99  # allow some outliers
    )

    # expectation of means based on observed values
    observed_mean = ds.trip_count.mean()
    ds.expect_column_mean_to_be_between("trip_count",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))

    observed_mean = ds.earned_per_hour.mean()
    ds.expect_column_mean_to_be_between("earned_per_hour",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))


    # expectation of quantiles
    qs = [0.5, 0.75, 0.9, 0.95]
    observed_quantiles = ds.avg_fare.quantile(qs)

    ds.expect_column_quantile_values_to_be_between(
        "avg_fare",
        quantile_ranges={
            "quantiles": qs,
            "value_ranges": [[None, max_value] for max_value in observed_quantiles]
        })

    return ds.get_expectation_suite()

Testing our profiler function:

ds.get_profile(profiler=stats_profiler)
02/02/2022 02:43:47 PM INFO:	5 expectation(s) included in expectation_suite. result_format settings filtered.
<GEProfile with expectations: [
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "avg_speed",
      "min_value": 0,
      "max_value": 60,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "total_miles_travelled",
      "min_value": 0,
      "max_value": 500,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "trip_count",
      "min_value": 10.387244591346153,
      "max_value": 12.695521167200855
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "earned_per_hour",
      "min_value": 52.320624975640214,
      "max_value": 63.94743052578249
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_quantile_values_to_be_between",
    "kwargs": {
      "column": "avg_fare",
      "quantile_ranges": {
        "quantiles": [
          0.5,
          0.75,
          0.9,
          0.95
        ],
        "value_ranges": [
          [
            null,
            16.4
          ],
          [
            null,
            26.229166666666668
          ],
          [
            null,
            36.4375
          ],
          [
            null,
            42.0
          ]
        ]
      }
    },
    "meta": {}
  }
]>

Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).

Now we can create validation reference from dataset and profiler function:

validation_reference = ds.as_reference(profiler=stats_profiler)

and test it against our existing retrieval job

_ = job.to_df(validation_reference=validation_reference)
02/02/2022 02:43:52 PM INFO: 5 expectation(s) included in expectation_suite. result_format settings filtered.
02/02/2022 02:43:53 PM INFO: Validating data_asset_name None with expectation_suite_name default

Validation successfully passed as no exception were raised.

5. Validating new historical retrieval

Creating new timestamps for Dec 2020:

from feast.dqm.errors import ValidationFailed
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2020-12-01", "2020-12-07", freq='D')
entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df
taxi_id
event_timestamp

0

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-01

1

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-02

2

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-03

3

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-04

4

91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...

2020-12-05

...

...

...

35443

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-03

35444

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-04

35445

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-05

35446

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-06

35447

7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...

2020-12-07

35448 rows × 2 columns

job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)

Execute retrieval job with validation reference:

try:
    df = job.to_df(validation_reference=validation_reference)
except ValidationFailed as exc:
    print(exc.validation_report)
02/02/2022 02:43:58 PM INFO: 5 expectation(s) included in expectation_suite. result_format settings filtered.
02/02/2022 02:43:59 PM INFO: Validating data_asset_name None with expectation_suite_name default

[
  {
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "trip_count",
        "min_value": 10.387244591346153,
        "max_value": 12.695521167200855,
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": 6.692920555429092,
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  },
  {
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "earned_per_hour",
        "min_value": 52.320624975640214,
        "max_value": 63.94743052578249,
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": 68.99268345164135,
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  },
  {
    "expectation_config": {
      "expectation_type": "expect_column_quantile_values_to_be_between",
      "kwargs": {
        "column": "avg_fare",
        "quantile_ranges": {
          "quantiles": [
            0.5,
            0.75,
            0.9,
            0.95
          ],
          "value_ranges": [
            [
              null,
              16.4
            ],
            [
              null,
              26.229166666666668
            ],
            [
              null,
              36.4375
            ],
            [
              null,
              42.0
            ]
          ]
        },
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "meta": {},
    "result": {
      "observed_value": {
        "quantiles": [
          0.5,
          0.75,
          0.9,
          0.95
        ],
        "values": [
          19.5,
          28.1,
          38.0,
          44.125
        ]
      },
      "element_count": 35448,
      "missing_count": 31055,
      "missing_percent": 87.6071992778154,
      "details": {
        "success_details": [
          false,
          false,
          false,
          false
        ]
      }
    },
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    },
    "success": false
  }
]

Validation failed since several expectations didn't pass:

  • Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)

  • Average Fare increased - all quantiles are higher than expected

  • Earn per hour (mean) increased more than 10% (most probably due to increased fare)

Data source

The data source refers to raw underlying data (e.g. a table in BigQuery).

Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.

Below is an example data source with a single entity (driver) and two features (trips_today, and rating).

Feature retrieval

Dataset

A dataset is a collection of rows that is produced by a historical retrieval from Feast in order to train a model. A dataset is produced by a join from one or more feature views onto an entity dataframe. Therefore, a dataset may consist of features from multiple feature views.

Dataset vs Feature View: Feature views contain the schema of data and a reference to where data can be found (through its data source). Datasets are the actual data manifestation of querying those data sources.

Dataset vs Data Source: Datasets are the output of historical retrieval, whereas data sources are the inputs. One or more data sources can be used in the creation of a dataset.

Feature Services

from driver_ratings_feature_view import driver_ratings_fv
from driver_trips_feature_view import driver_stats_fv

driver_stats_fs = FeatureService(
    name="driver_activity",
    features=[driver_stats_fv, driver_ratings_fv[["lifetime_rating"]]]
)

Feature services are used during

  • The generation of training datasets when querying feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.

  • Retrieval of features for batch scoring from the offline store (e.g. with an entity dataframe where all timestamps are now())

  • Retrieval of features from the online store for online inference (with smaller batch sizes). The features retrieved from the online store may also belong to multiple feature views.

Applying a feature service does not result in an actual service being deployed.

Feature services enable referencing all or some features from a feature view.

Retrieving from the online store with a feature service

from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity")
features = feature_store.get_online_features(
    features=feature_service, entity_rows=[entity_dict]
)

Retrieving from the offline store with a feature service

from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity")
feature_store.get_historical_features(features=feature_service, entity_df=entity_df)

Feature References

This mechanism of retrieving features is only recommended as you're experimenting. Once you want to launch experiments or serve models, feature services are recommended.

Feature references uniquely identify feature values in Feast. The structure of a feature reference in string form is as follows: <feature_view>:<feature>

Feature references are used for the retrieval of features from Feast:

online_features = fs.get_online_features(
    features=[
        'driver_locations:lon',
        'drivers_activity:trips_today'
    ],
    entity_rows=[
        # {join_key: entity_value}
        {'driver': 'driver_1001'}
    ]
)

It is possible to retrieve features from multiple feature views with a single request, and Feast is able to join features from multiple tables in order to build a training dataset. However, It is not possible to reference (or retrieve) features from multiple projects at the same time.

Event timestamp

The timestamp on which an event occurred, as found in a feature view's data source. The event timestamp describes the event time at which a feature was observed or generated.

Event timestamps are used during point-in-time joins to ensure that the latest feature values are joined from feature views onto entity rows. Event timestamps are also used to ensure that old feature values aren't served to models during online serving.

Provider

A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).

Running Feast with Snowflake/GCP/AWS

Real-time credit scoring on AWS

Credit scoring models are used to approve or reject loan applications. In this tutorial we will build a real-time credit scoring system on AWS.

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is often made through a statistical model. This model uses information about a customer to determine the likelihood that they will repay or default on a loan, in a process called credit scoring.

In this example, we will demonstrate how a real-time credit scoring system can be built using Feast and Scikit-Learn on AWS, using feature data from S3.

This real-time system accepts a loan request from a customer and responds within 100ms with a decision on whether their loan has been approved or rejected.

This end-to-end tutorial will take you through the following steps:

  • Deploying Redshift as the interface Feast uses to build training datasets

  • Registering your features with Feast and configuring DynamoDB for online serving

  • Building a training dataset with Feast to train your credit scoring model

  • Loading feature values from S3 into DynamoDB

  • Making online predictions with your credit scoring model using features from DynamoDB

Overview

Functionality

  • Create Batch Features: ELT/ETL systems like Spark and SQL are used to transform data in the batch store.

  • Feast Apply: The user (or CI) publishes versioned controlled feature definitions using feast apply. This CLI command updates infrastructure and persists definitions in the object store registry.

  • Feast Materialize: The user (or scheduler) executes feast materialize which loads features from the offline store into the online store.

  • Model Training: A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset and trains a model.

  • Get Historical Features: Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.

  • Deploy Model: The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.

  • Prediction: A backend system makes a request for a prediction from the model serving service.

  • Get Online Features: The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.

Components

A complete Feast deployment contains the following components:

  • Feast Registry: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.

  • Feast Python SDK/CLI: The primary user facing SDK. Used to:

    • Manage version controlled feature definitions.

    • Materialize (load) feature values into the online store.

    • Build and retrieve training datasets from the offline store.

    • Retrieve online features.

  • Offline Store: The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. Feast does not manage the offline store directly, but runs queries against it.

Java and Go Clients are also available for online feature retrieval.

Feature repository

Feast users use Feast to manage two important sets of configuration:

  • Configuration about how to run Feast on your infrastructure

  • Feature definitions

With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.

The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.

An example structure of a feature repository is shown below:

or system: Feast is not (and does not plan to become) a general purpose data transformation or pipelining system. Feast plans to include a light-weight feature engineering toolkit, but we encourage teams to integrate Feast with upstream ETL/ELT systems that are specialized in transformation.

The best way to learn Feast is to use it. Head over to our and try it out!

is the fastest way to get started with Feast

describes all important Feast API concepts

describes Feast's overall architecture.

shows full examples of using Feast in machine learning applications.

provides a more in-depth guide to using Feast.

contains detailed API and design documents.

contains resources for anyone who wants to contribute to Feast.

The original notebook and datasets for this tutorial can be found on .

Read more about feature views in

Read more about on demand feature views

Feast uses as a validation engine and as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.

A feature service is an object that represents a logical group of features from one or more . Feature Services allows features from within a feature view to be used as needed by an ML model. Users can expect to create one feature service per model version, allowing for tracking of the features used by models.

Note, if you're using , then those features can be added here without additional entity values in the entity_rows

Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the gcp provider supports as an offline store and as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (local, gcp, and aws) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the gcp provider but use Redis as the online store instead of Datastore.

If the built-in providers are not sufficient, you can create your own custom provider. Please see for more details.

Please see for configuring providers.

Deploying S3 with Parquet as your primary data source, containing both and

Online Store: The online store is a database that stores only the latest feature values for each entity. The online store is populated by materialization jobs and from .

For more details, see the reference.

ETL
ELT
Quickstart
Quickstart
Concepts
Architecture
Tutorials
Running Feast with Snowflake/GCP/AWS
Reference
Contributing
GitHub
Feast docs
here
Great Expectations
ExpectationSuite
PandasDataset
BigQuery
Datastore
this guide
Install Feast
Create a feature repository
Deploy a feature store
Build a training dataset
Load data into the online store
Read features from the online store
feature views
Feature views without entities
$ tree -a
.
├── data
│   └── driver_stats.parquet
├── driver_features.py
├── feature_store.yaml
└── .feastignore

1 directory, 4 files
Real-time Credit Scoring Example
loan features
zip code features
stream ingestion
Feature repository

Deploy a feature store

Deploying

To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

feast apply

# Processing example.py as example
# Done!

Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

Cleaning up

If you need to clean up the infrastructure created by feast apply, use the teardown command.

Warning: teardown is an irreversible command and will remove all feature store infrastructure. Proceed with caution!

feast teardown

****

Overview

These Feast tutorials showcase how to use Feast to simplify end to end model training / serving.

Install Feast

Install Feast with Snowflake dependencies (required when using Snowflake):

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

Fraud detection on GCP

A common use case in machine learning, this tutorial is an end-to-end, production-ready fraud prediction system. It predicts in real-time whether a transaction made by a user is fraudulent.

Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system. A prediction is made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our end-to-end example will perform the following workflows:

  • Computing and backfilling feature data from raw data

  • Building point-in-time correct training datasets from feature data and training a model

  • Making online predictions from feature data

Here's a high-level picture of our system architecture on Google Cloud Platform (GCP):

Driver stats on Snowflake

Initial demonstration of Snowflake as an offline store with Feast, using the Snowflake demo template.

In the steps below, we will set up a sample Feast project that leverages Snowflake as an offline store.

Starting with data in a Snowflake table, we will register that table to the feature store and define features associated with the columns in that table. From there, we will generate historical training data based on those feature definitions and then materialize the latest feature values into the online store. Lastly, we will retrieve the materialized feature values.

Our template will generate new data containing driver statistics. From there, we will show you code snippets that will call to the offline store for generating training datasets, and then the code for calling the online store to serve you the latest feature values to serve models in production.

Snowflake Offline Store Example

Install feast-snowflake

Get a Snowflake Trial Account (Optional)

Create a feature repository

The following files will automatically be created in your project folder:

  • feature_store.yaml -- This is your main configuration file

  • driver_repo.py -- This is your main feature definition file

  • test.py -- This is a file to test your feature store configuration

Inspect feature_store.yaml

Here you will see the information that you entered. This template will use Snowflake as an offline store and SQLite as the online store. The main thing to remember is by default, Snowflake objects have ALL CAPS names unless lower case was specified.

Run our test python script test.py

What we did in test.py

Initialize our Feature Store

Create a dummy training dataframe, then call our offline store to add additional columns

Materialize the latest feature values into our online store

Retrieve the latest values from our online store based on our entity key

Run in Google Colab
View Source in Github

The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your file, as well as the feature definitions found in your feature repository.

Here we'll be using the example repository we created in the previous guide, . You can re-create it by running feast init in a new directory.

At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize. See for more details.

Install Feast using :

feature_store.yaml
Create a feature store
Load data into the online store
Fraud detection on GCP
Driver ranking
Real-time credit scoring on AWS
Driver stats on Snowflake
Validating historical features with Great Expectations
pip install feast
pip install 'feast[snowflake]'
pip install 'feast[gcp]'
pip install 'feast[aws]'
pip install 'feast[redis]'
pip install 'feast[snowflake]'
feast init -t snowflake {feature_repo_name}
Snowflake Deployment URL (exclude .snowflakecomputing.com):
Snowflake User Name::
Snowflake Password::
Snowflake Role Name (Case Sensitive)::
Snowflake Warehouse Name (Case Sensitive)::
Snowflake Database Name (Case Sensitive)::
Should I upload example data to Snowflake (overwrite table)? [Y/n]: Y
cd {feature_repo_name}
feature_store.yaml
project: ...
registry: ...
provider: local
offline_store:
    type: snowflake.offline
    account: SNOWFLAKE_DEPLOYMENT_URL #drop .snowflakecomputing.com
    user: USERNAME
    password: PASSWORD
    role: ROLE_NAME #case sensitive
    warehouse: WAREHOUSE_NAME #case sensitive
    database: DATABASE_NAME #case cap sensitive
python test.py
test.py
from datetime import datetime, timedelta

import pandas as pd
from driver_repo import driver, driver_stats_fv

from feast import FeatureStore

fs = FeatureStore(repo_path=".")

fs.apply([driver, driver_stats_fv])
test.py
entity_df = pd.DataFrame(
    {
        "event_timestamp": [
            pd.Timestamp(dt, unit="ms", tz="UTC").round("ms")
            for dt in pd.date_range(
                start=datetime.now() - timedelta(days=3),
                end=datetime.now(),
                periods=3,
            )
        ],
        "driver_id": [1001, 1002, 1003],
    }
)

features = ["driver_hourly_stats:conv_rate", "driver_hourly_stats:acc_rate"]

training_df = fs.get_historical_features(
    features=features, entity_df=entity_df
).to_df()
test.py
fs.materialize_incremental(end_date=datetime.now())
test.py
online_features = fs.get_online_features(
    features=features,
    entity_rows=[
      # {join_key: entity_value}
      {"driver_id": 1001},
      {"driver_id": 1002}
    ],
).to_dict()
pip
Fraud Detection Example
Snowflake Trial Account

Learning by example

This workshop aims to teach users about Feast.

We explain concepts & best practices by example, and also showcase how to address common use cases.

Pre-requisites

This workshop assumes you have the following installed:

  • A local development environment that supports running Jupyter notebooks (e.g. VSCode with Jupyter plugin)

  • Python 3.7+

  • Java 11 (for Spark, e.g. brew install java11)

  • pip

  • Docker & Docker Compose (e.g. brew install docker docker-compose)

  • AWS CLI

Since we'll be learning how to leverage Feast in CI/CD, you'll also need to fork this workshop repository.

Caveats

Modules

These are meant mostly to be done in order, with examples building on previous concepts.

Time (min)
Description
Module

30-45

Setting up Feast projects & CI/CD + powering batch predictions

15-20

Streaming ingestion & online feature retrieval with Kafka, Spark, Redis

10-15

Real-time feature engineering with on demand transformations

TBD

Feature server deployment (embed, as a service, AWS Lambda)

TBD

TBD

Versioning features / models in Feast

TBD

TBD

Data quality monitoring in Feast

TBD

TBD

Batch transformations

TBD

TBD

Stream transformations

TBD

Adding a custom provider

Overview

All Feast operations execute through a provider. Operations like materializing data from the offline to the online store, updating infrastructure like databases, launching streaming ingestion jobs, building training datasets, and reading features from the online store.

Custom providers allow Feast users to extend Feast to execute any custom logic. Examples include:

  • Launching custom streaming ingestion jobs (Spark, Beam)

  • Launching custom batch ingestion (materialization) jobs (Spark, Beam)

  • Adding custom validation to feature repositories during feast apply

  • Adding custom infrastructure setup logic which runs during feast apply

  • Extending Feast commands with in-house metrics, logging, or tracing

Guide

The fastest way to add custom logic to Feast is to extend an existing provider. The most generic provider is the LocalProvider which contains no cloud-specific logic. The guide that follows will extend the LocalProvider with operations that print text to the console. It is up to you as a developer to add your custom code to the provider methods, but the guide below will provide the necessary scaffolding to get you started.

Step 1: Define a Provider class

The first step is to define a custom provider class. We've created the MyCustomProvider below.

Notice how in the above provider we have only overwritten two of the methods on the LocalProvider, namely update_infra and materialize_single_feature_view. These two methods are convenient to replace if you are planning to launch custom batch or streaming jobs. update_infra can be used for launching idempotent streaming jobs, and materialize_single_feature_view can be used for launching batch ingestion jobs.

Step 2: Configuring Feast to use the provider

Notice how the provider field above points to the module and class where your provider can be found.

Step 3: Using the provider

Now you should be able to use your provider by running a Feast command:

It may also be necessary to add the module root path to your PYTHONPATH as follows:

That's it. You should now have a fully functional custom provider!

Next steps

View Source on Github

Terraform ()

An AWS account setup with credentials via aws configure (e.g see )

M1 Macbook development is untested with this flow. See also .

Windows development has only been tested with WSL. You will need to follow this to have Docker play nicely.

See also: ,

Feast comes with built-in providers, e.g, LocalProvider, GcpProvider, and AwsProvider. However, users can develop their own providers by creating a class that implements the contract in the .

This guide also comes with a fully functional . Please have a look at the repository for a representative example of what a custom provider looks like, or fork the repository when creating your own provider.

It is possible to overwrite all the methods on the provider class. In fact, it isn't even necessary to subclass an existing provider like LocalProvider. The only requirement for the provider class is that it follows the .

Configure your file to point to your new provider class:

Have a look at the for a fully functional example of a custom provider. Feel free to fork it when creating your own custom provider!

docs
AWS credentials quickstart
How to run / develop for Feast on M1 Macs
guide
Feast quickstart
Feast x Great Expectations tutorial
from datetime import datetime
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union

from feast.entity import Entity
from feast.feature_table import FeatureTable
from feast.feature_view import FeatureView
from feast.infra.local import LocalProvider
from feast.infra.offline_stores.offline_store import RetrievalJob
from feast.protos.feast.types.EntityKey_pb2 import EntityKey as EntityKeyProto
from feast.protos.feast.types.Value_pb2 import Value as ValueProto
from feast.registry import Registry
from feast.repo_config import RepoConfig


class MyCustomProvider(LocalProvider):
    def __init__(self, config: RepoConfig, repo_path):
        super().__init__(config)
        # Add your custom init code here. This code runs on every Feast operation.

    def update_infra(
        self,
        project: str,
        tables_to_delete: Sequence[Union[FeatureTable, FeatureView]],
        tables_to_keep: Sequence[Union[FeatureTable, FeatureView]],
        entities_to_delete: Sequence[Entity],
        entities_to_keep: Sequence[Entity],
        partial: bool,
    ):
        super().update_infra(
            project,
            tables_to_delete,
            tables_to_keep,
            entities_to_delete,
            entities_to_keep,
            partial,
        )
        print("Launching custom streaming jobs is pretty easy...")

    def materialize_single_feature_view(
        self,
        config: RepoConfig,
        feature_view: FeatureView,
        start_date: datetime,
        end_date: datetime,
        registry: Registry,
        project: str,
        tqdm_builder: Callable[[int], tqdm],
    ) -> None:
        super().materialize_single_feature_view(
            config, feature_view, start_date, end_date, registry, project, tqdm_builder
        )
        print("Launching custom batch jobs is pretty easy...")
project: repo
registry: registry.db
provider: feast_custom_provider.custom_provider.MyCustomProvider
online_store:
    type: sqlite
    path: online_store.db
offline_store:
    type: file
feast apply
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats
Launching custom streaming jobs is pretty easy...
PYTHONPATH=$PYTHONPATH:/home/my_user/my_custom_provider feast apply
Module 0
Module 1
Module 2
Provider class
custom provider demo repository
Provider contract
feature_store.yaml
custom provider demo repository

Deploying a Java feature server on Kubernetes

This tutorial guides you on how to:

  • Define features and data sources in Feast using the Feast CLI

  • Materialize features to a Redis cluster deployed on Kubernetes.

  • Deploy a Feast Java feature server into a Kubernetes cluster using the Feast helm charts

  • Retrieve features using the gRPC API exposed by the Feast Java server

Try it and let us know what you think!

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Retrieving online features

1. Ensure that feature values have been loaded into the online store

Please ensure that you have materialized (loaded) your feature values into the online store before starting

2. Define feature references

Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

features = [
    "driver_hourly_stats:conv_rate",
    "driver_hourly_stats:acc_rate"
]

3. Read online features

Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

fs = FeatureStore(repo_path="path/to/feature/repo")
online_features = fs.get_online_features(
    features=features,
    entity_rows=[
        # {join_key: entity_value, ...}
        {"driver_id": 1001},
        {"driver_id": 1002}]
).to_dict()
{
   "driver_hourly_stats__acc_rate":[
      0.2897740304470062,
      0.6447265148162842
   ],
   "driver_hourly_stats__conv_rate":[
      0.6508077383041382,
      0.14802511036396027
   ],
   "driver_id":[
      1001,
      1002
   ]
}

Create a feature repository

The easiest way to create a new feature repository to use feast init command:

The init command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:

Enter the directory:

You can now use this feature repository for development. You can try the following:

  • Run feast apply to apply these definitions to Feast.

  • Edit the example feature definitions in example.py and run feast apply again to change feature definitions.

  • Initialize a git repository in the same directory and checking the feature repository into version control.

Run in Google Colab
View Source on Github

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See for a detailed explanation of feature repositories.

Load data into the online store
feast init

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t snowflake
Snowflake Deployment URL: ...
Snowflake User Name: ...
Snowflake Password: ...
Snowflake Role Name: ...
Snowflake Warehouse Name: ...
Snowflake Database Name: ...

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t gcp

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t aws
AWS Region (e.g. us-west-2): ...
Redshift Cluster ID: ...
Redshift Database Name: ...
Redshift User Name: ...
Redshift S3 Staging Location (s3://*): ...
Redshift IAM Role for S3 (arn:aws:iam::*:role/*): ...
Should I upload example data to Redshift (overwriting 'feast_driver_hourly_stats' table)? (Y/n):

Creating a new Feast repository in /<...>/tiny_pika.
$ tree
.
└── tiny_pika
    ├── data
    │   └── driver_stats.parquet
    ├── example.py
    └── feature_store.yaml

1 directory, 3 files
# Replace "tiny_pika" with your auto-generated dir name
cd tiny_pika
Feature Repository
View guide in Github

Load data into the online store

Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

Materializing features

1. Register feature views

Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

2.a Materialize

The materialize command allows users to materialize features over a specific historical time range into the online store.

feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00

The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

It is also possible to materialize for specific feature views by using the -v / --views argument.

feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00 \
--views driver_hourly_stats

The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

2.b Materialize Incremental (Alternative)

For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize, materialize-incremental will track the state of previous ingestion runs inside of the feature registry.

The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00).

feast materialize-incremental 2021-04-08T00:00:00

The materialize-incremental command functions similarly to materialize in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.

Unlike materialize, materialize-incremental automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental, the end timestamp will be tracked.

Subsequent runs of materialize-incremental will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.

Deploy a feature store

Build a training dataset

Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.

Retrieving historical features

1. Register your feature views

Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.

2. Define feature references

Start by defining the feature references (e.g., driver_trips:average_daily_rides) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.

feature_refs = [
    "driver_trips:average_daily_rides",
    "driver_trips:maximum_daily_rides",
    "driver_trips:rating",
    "driver_trips:rating:trip_completed",
]

3. Create an entity dataframe

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.

It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.

Pandas:

In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp column and a driver_id entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.

import pandas as pd
from datetime import datetime

entity_df = pd.DataFrame(
    {
        "event_timestamp": [pd.Timestamp(datetime.now(), tz="UTC")],
        "driver_id": [1001]
    }
)

SQL (Alternative):

Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).

entity_df = "SELECT event_timestamp, driver_id FROM my_gcp_project.table"

4. Launch historical retrieval

from feast import FeatureStore

fs = FeatureStore(repo_path="path/to/your/feature/repo")

training_df = fs.get_historical_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate"
    ],
    entity_df=entity_df
).to_df()

Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features(). This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

Deploy a feature store

Running Feast in production

Overview

After learning about Feast concepts and playing with Feast locally, you're now ready to use Feast in production. This guide aims to help with the transition from a sandbox project to production-grade deployment in the cloud or on-premise.

Overview of typical production configuration is given below:

Important note: We're trying to keep Feast modular. With the exception of the core, most of the Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration. For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features.

Furthermore, there's no single "true" approach. As you will see in this guide, Feast usually provides several options for each problem. It's totally up to you to pick a path that's better suited to your needs.

In this guide we will show you how to:

  1. Deploy your feature store and keep your infrastructure in sync with your feature repository

  2. Keep the data in your online store up to date

  3. Use Feast for model training and serving

  4. Ingest features from a stream source

  5. Monitor your production deployment

1. Automatically deploying changes to your feature definitions

The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits.

The contents of this repository are shown below:

├── .github
│   └── workflows
│       ├── production.yml
│       └── staging.yml
│
├── staging
│   ├── driver_repo.py
│   └── feature_store.yaml
│
└── production
    ├── driver_repo.py
    └── feature_store.yaml

The repository contains three sub-folders:

  • staging/: This folder contains the staging feature_store.yaml and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.

  • production/: This folder contains the production feature_store.yaml and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.

  • .github: This folder is an example of a CI system that applies the changes in either the staging or production repositories using feast apply. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.

The feature_store.yaml contains the following:

project: staging
registry: gs://feast-ci-demo-registry/staging/registry.db
provider: gcp

Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply are tracked in the registry.db. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.

It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.

If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:

└── production
    ├── common
    │    ├── __init__.py
    │    ├── sources.py
    │    └── entities.py
    ├── ranking
    │    ├── __init__.py
    │    ├── views.py
    │    └── transformations.py
    ├── segmentation
    │    ├── __init__.py
    │    ├── views.py
    │    └── transformations.py
    └── feature_store.yaml

In summary, once you have set up a Git based repository with CI that runs feast apply on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.

2. How to load data into your online store and keep it up to date

To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.

2.1. Manual materializations

The simplest way to schedule materialization is to run an incremental materialization using the Feast CLI:

feast materialize-incremental 2022-01-01T00:00:00

The above command will load all feature values from all feature view sources into the online store up to the time 2022-01-01T00:00:00.

A timestamp is required to set the end date for materialization. If your source is fully up to date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time materialize-incremental is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.

An alternative approach to incremental materialization (where Feast tracks the intervals of data that need to be ingested), is to call Feast directly from your scheduler like Airflow. In this case, Airflow is the system that tracks the intervals that have been ingested.

feast materialize -v driver_hourly_stats 2020-01-01T00:00:00 2020-01-02T00:00:00

In the above example we are materializing the source data from the driver_hourly_stats feature view over a day. This command can be scheduled as the final operation in your Airflow ETL, which runs after you have computed your features and stored them in the source location. Feast will then load your feature data into your online store.

The timestamps above should match the interval of data that has been computed by the data transformation system.

2.2. Automate periodic materializations

materialize = BashOperator(
    task_id='materialize',
    bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
)

Important note: Airflow worker must have read and write permissions to the registry file on GS / S3 since it pulls configuration and updates materialization history.

3. How to use Feast for model training

After we've defined our features and data sources in the repository, we can generate training datasets.

The first thing we need to do in our training code is to create a FeatureStore object with a path to the registry.

One way to ensure your production clients have access to the feature store is to provide a copy of the feature_store.yaml to those pipelines. This feature_store.yaml file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.

fs = FeatureStore(repo_path="production/")

Then, training data can be retrieved as follows:

feature_refs = [
    'driver_hourly_stats:conv_rate',
    'driver_hourly_stats:acc_rate',
    'driver_hourly_stats:avg_daily_trips'
]

training_df = fs.get_historical_features(
    entity_df=entity_df, 
    features=feature_refs,
).to_df()

model = ml.fit(training_df)

The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the list of feature references also be saved alongside the model. This ensures that models and the features they are trained on are paired together when being shipped into production:

# Save model
model.save('my_model.bin')

# Save features
open('feature_refs.json', 'w') as f:
    json.dump(feature_refs, f)

To test your model locally, you can simply create a FeatureStore object, fetch online features, and then make a prediction:

# Load model
model = ml.load('my_model.bin')

# Load feature references
with open('feature_refs.json', 'r') as f:
    feature_refs = json.load(f)

# Create feature store object
fs = FeatureStore(repo_path="production/")

# Read online features
feature_vector = fs.get_online_features(
    features=feature_refs,
    entity_rows=[{"driver_id": 1001}]
).to_dict()

# Make a prediction
prediction = model.predict(feature_vector)

It is important to note that both the training pipeline and model serving service need only read access to the feature registry and associated infrastructure. This prevents clients from accidentally making changes to the feature store.

4. Retrieving online features for prediction

Once you have successfully loaded (or in Feast terminology materialized) your data from batch sources into the online store, you can start consuming features for model inference. There are three approaches for that purpose sorted from the most simple one (in an operational sense) to the most performant (benchmarks to be published soon):

4.1. Use the Python SDK within an existing Python service

This approach is the most convenient to keep your infrastructure as minimalistic as possible and avoid deploying extra services. The Feast Python SDK will connect directly to the online store (Redis, Datastore, etc), pull the feature data, and run transformations locally (if required). The obvious drawback is that your service must be written in Python to use the Feast Python SDK. A benefit of using a Python stack is that you can enjoy production-grade services with integrations with many existing data science tools.

To integrate online retrieval into your service use the following code:

from feast import FeatureStore

with open('feature_refs.json', 'r') as f:
    feature_refs = json.loads(f)

fs = FeatureStore(repo_path="production/")

# Read online features
feature_vector = fs.get_online_features(
    features=feature_refs,
    entity_rows=[{"driver_id": 1001}]
).to_dict()

4.2. Consume features via HTTP API from Serverless Feature Server

If you don't want to add the Feast Python SDK as a dependency, or your feature retrieval service is written in a non-Python language, Feast can deploy a simple feature server on serverless infrastructure (eg, AWS Lambda, Google Cloud Run) for you. This service will provide an HTTP API with JSON I/O, which can be easily used with any programming language.

4.3. Java based Feature Server deployed on Kubernetes

For users with very latency-sensitive and high QPS use-cases, Feast offers a high-performance Java feature server. Besides the benefits of running on JVM, this implementation also provides a gRPC API, which guarantees good connection utilization and small request / response body size (compared to JSON). You will need the Feast Java SDK to retrieve features from this service. This SDK wraps all the gRPC logic for you and provides more convenient APIs.

The Java based feature server can be deployed to Kubernetes cluster via Helm charts in a few simple steps:

  1. Add the Feast Helm repository and download the latest charts:

helm repo add feast-charts https://feast-helm-charts.storage.googleapis.com
helm repo update
  1. Run Helm Install

helm install feast-release feast-charts/feast \
    --set global.registry.path=s3://feast/registries/prod \
    --set global.project=<project name>

This chart will deploy two services: feature-server and transformation-service. Both must have read access to the registry file on cloud storage. Both will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.

Load balancing

5. Ingesting features from a stream source

5.1. Using Python SDK in your Apache Spark / Beam pipeline

The default option to write features from a stream is to add the Python SDK into your existing PySpark / Beam pipeline. Feast SDK provides writer implementation that can be called from foreachBatch stream writer in PySpark like this:

store = FeatureStore(...)

def feast_writer(spark_df):
    pandas_df = spark_df.to_pandas()
    store.push("driver_hourly_stats", pandas_df)

streamingDF.writeStream.foreachBatch(feast_writer).start()

5.2. Push service (still under development)

Alternatively, if you want to ingest features directly from a broker (eg, Kafka or Kinesis), you can use the "push service", which will write to an online store. This service will expose an HTTP API or when deployed on Serverless platforms like AWS Lambda or Google Cloud Run, this service can be directly connected to Kinesis or PubSub.

6. Monitoring

Feast services can report their metrics to a StatsD-compatible collector. To activate this function, you'll need to provide a StatsD IP address and a port when deploying the helm chart (in future, this will be added to feature_store.yaml).


Summary

Summarizing it all together we want to show several options of architecture that will be most frequently used in production:

Option #1 (currently preferred)

  • Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast registry

  • Airflow manages materialization jobs to ingest data from DWH to the online store periodically

  • For the stream ingestion Feast Python SDK is used in the existing Spark / Beam pipeline

  • Online features are served via either a Python feature server or a high performance Java feature server

    • Both the Java feature server and the transformation server are deployed on Kubernetes cluster (via Helm charts)

  • Feast Python SDK is called locally to generate a training dataset

Option #2 (still in development)

Same as Option #1, except:

  • Push service is deployed as AWS Lambda / Google Cloud Run and is configured as a sink for Kinesis or PubSub to ingest features directly from a stream broker. Lambda / Cloud Run is being managed by Feast SDK (from CI environment)

  • Materialization jobs are managed inside Kubernetes via Kubernetes Job (currently not managed by Helm)

Option #3 (still in development)

Same as Option #2, except:

  • Push service is deployed on Kubernetes cluster and exposes an HTTP API that can be used as a sink for Kafka (via kafka-http connector) or accessed directly.

Redshift

Description

Redshift data sources allow for the retrieval of historical feature values from Redshift for building training datasets as well as materializing features into an online store.

  • Either a table name or a SQL query can be provided.

  • No performance guarantees can be provided over SQL query-based sources. Please use table references where possible.

Examples

Using a table name

Using a query

Adding or reusing tests

Overview

This guide will go over:

  1. how Feast tests are setup

  2. how to extend the test suite to test new functionality

  3. how to use the existing test suite to test a new custom offline / online store.

Test suite overview

Let's inspect the test setup in sdk/python/tests/integration:

feature_repos has setup files for most tests in the test suite and pytest fixtures for other tests. These fixtures parametrize on different offline stores, online stores, etc. and thus abstract away store specific implementations so tests don't need to rewrite e.g. uploading dataframes to a specific store for setup.

Understanding an example test

Let's look at a sample test using the universal repo:

The key fixtures are the environment and universal_data_sources fixtures, which are defined in the feature_repos directories. This by default pulls in a standard dataset with driver and customer entities, certain feature views, and feature values. By including the environment as a parameter, the test automatically parametrizes across other offline / online store combinations.

Writing a new test or reusing existing tests

To add a new test to an existing test file

  • Use the same function signatures as an existing test (e.g. use environment as an argument) to include the relevant test fixtures.

  • If possible, expand an individual test instead of writing a new test, due to the cost of standing up offline / online stores.

To test a new offline / online store from a plugin repo

  • Install Feast in editable mode with pip install -e.

  • The core tests for offline / online store behavior are parametrized by the FULL_REPO_CONFIGS variable defined in feature_repos/repo_configuration.py. To overwrite this variable without modifying the Feast repo, create your own file that contains a FULL_REPO_CONFIGS (which will require adding a new IntegrationTestRepoConfig or two) and set the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. Then the core offline / online store tests can be run with make test-python-universal.

To include a new offline / online store in the main Feast repo

  • Extend data_source_creator.py for your offline store.

  • In repo_configuration.py add a newIntegrationTestRepoConfig or two (depending on how many online stores you want to test).

  • Run the full test suite with make test-python-integration.

Including a new offline / online store in the main Feast repo from external plugins with community maintainers.

  • This folder is for plugins that are officially maintained with community owners. Place the APIs in feast/infra/offline_stores/contrib/.

  • Extend data_source_creator.py for your offline store and implement the required APIs.

  • In contrib_repo_configuration.py add a new IntegrationTestRepoConfig (depending on how many online stores you want to test).

  • Run the test suite on the contrib test suite with make test-python-contrib-universal.

To include a new online store

  • In repo_configuration.py add a new config that maps to a serialized version of configuration you need in feature_store.yaml to setup the online store.

  • In repo_configuration.py, add newIntegrationTestRepoConfig for offline stores you want to test.

  • Run the full test suite with make test-python-integration

To use custom data in a new test

  • Check test_universal_types.py for an example of how to do this.

Running your own redis cluster for testing

  • Install redis on your computer. If you are a mac user, you should be able to brew install redis.

    • Running redis-server --help and redis-cli --help should show corresponding help menus.

  • Run cd scripts/create-cluster and run ./create-cluster start then ./create-cluster create to start the server. You should see output that looks like this:

  • You should be able to run the integration tests and have the redis cluster tests pass.

  • If you would like to run your own redis cluster, you can run the above commands with your own specified ports and connect to the newly configured cluster.

  • To stop the cluster, run ./create-cluster stop and then ./create-cluster clean.

Adding a new offline store

Overview

In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.

The process for using a custom offline store consists of 4 steps:

  1. Defining an OfflineStore class.

  2. Defining an OfflineStoreConfig class.

  3. Defining a RetrievalJob class for this offline store.

  4. Defining a DataSource class for the offline store

  5. Referencing the OfflineStore in a feature repo's feature_store.yaml file.

  6. Testing the OfflineStore class.

1. Defining an OfflineStore class

OfflineStore class names must end with the OfflineStore suffix!

The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.

There are two methods that deal with reading data from the offline storesget_historical_featuresand pull_latest_from_table_or_query.

  • pull_latest_from_table_or_query is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method. This method pull data from the offline store, and the FeatureStore class takes care of writing this data into the online store.

  • get_historical_features is invoked when reading values from the offline store using the FeatureStore.get_historical_features() method. Typically, this method is used to retrieve features when training ML models.

  • pull_all_from_table_or_query is a method that pulls all the data from an offline store from a specified start date to a specified end date.

2. Defining an OfflineStoreConfig class

Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.

This config class must container a type field, which contains the fully qualified class name of its corresponding OfflineStore class.

Additionally, the name of the config class must be the same as the OfflineStore class, with the Config suffix.

An example of the config class for the custom file offline store :

This configuration can be specified in the feature_store.yaml as follows:

This configuration information is available to the methods of the OfflineStore, via theconfig: RepoConfig parameter which is passed into the methods of the OfflineStore interface, specifically at the config.offline_store field of the config parameter.

3. Defining a RetrievalJob class

The offline store methods aren't expected to perform their read operations eagerly. Instead, they are expected to execute lazily, and they do so by returning a RetrievalJob instance, which represents the execution of the actual query against the underlying store.

Custom offline stores may need to implement their own instances of the RetrievalJob interface.

The RetrievalJob interface exposes two methods - to_df and to_arrow. The expectation is for the retrieval job to be able to return the rows read from the offline store as a parquet DataFrame, or as an Arrow table respectively.

4. Defining a DataSource class for the offline store

The data source class should implement two methods - from_proto, and to_proto.

For custom offline stores that are not being implemented in the main feature repo, the custom_options field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the to_proto method and reading the value back from bytes in the from_proto method.

5. Using the custom offline store

After implementing these classes, the custom offline store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the offline_store field. The value specified should be the fully qualified class name of the OfflineStore.

As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our custom file offline store, we can use the following feature_store.yaml:

If additional configuration for the offline store is **not **required, then we can omit the other fields and only specify the type of the offline store class as the value for the offline_store.

Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.

6. Testing the OfflineStore class

Even if you have created the OfflineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. In the Feast submodule, we can run all the unit tests with:

The universal tests, which are integration tests specifically intended to test offline and online stores, can be run with:

The unit tests should succeed, but the universal tests will likely fail. The tests are parametrized based on the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py. To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. The main challenge there will be to write a DataSourceCreator for the offline store. In this repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_offline_store/feast_tests.py, so you would run

to test the offline store against the Feast universal tests. You should notice that some of the tests actually fail; this indicates that there is a mistake in the implementation of this offline store!

Snowflake

Description

Snowflake data sources allow for the retrieval of historical feature values from Snowflake for building training datasets as well as materializing features into an online store.

  • Either a table reference or a SQL query can be provided.

Examples

Using a table reference

Using a query

Data sources

Most teams will need to have a feature store deployed to more than one environment. We have created an example repository () which contains two Feast projects, one per environment.

It is up to you which orchestration/scheduler to use to periodically run $ feast materialize. Feast keeps the history of materialization in its registry so that the choice could be as simple as a . Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.

If you are using Airflow as a scheduler, Feast can be invoked through the after the has been installed into a virtual environment and your feature repo has been synced:

Install and

The next step would be to install an L7 Load Balancer (eg, ) in front of the Java feature server. For seamless integration with Kubernetes (including services created by Feast Helm chart) we recommend using as Envoy's orchestrator.

Recently Feast added functionality for . Please note that this is still in an early phase and new incompatible changes may be introduced.

If you are using Kafka, could be utilized as a middleware. In this case, the "push service" can be deployed on Kubernetes or as a Serverless function.

We use an for StatsD format to be able to send tags along with metrics. Keep that in mind while selecting the collector ( will work for sure).

We chose StatsD since it's a de-facto standard with various implementations (eg, , ) and metrics can be easily exported to Prometheus, InfluxDB, AWS CloudWatch, etc.

Configuration options are available .

See the and the for examples.

Feast makes adding support for a new offline store (database) easy. Developers can simply implement the interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).

The full working code for this guide can be found at .

To facilitate configuration, all OfflineStore implementations are required to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the DataSource needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.

One thing to remember is how Snowflake handles table and column name conventions. You can read more about quote identifiers

Configuration options are available .

Please see for an explanation of data sources.

Feast Repository Example
unix cron util
BashOperator
Python SDK
Read more about this feature
kubectl
helm 3
Envoy
Istio
stream ingestion
HTTP Sink
InfluxDB-style extension
telegraph
1
2
from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    table="redshift_table",
)
from feast import RedshiftSource

my_redshift_source = RedshiftSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM redshift_table",
)
$ tree

.
├── e2e
│   └── test_universal_e2e.py
├── feature_repos
│   ├── repo_configuration.py
│   └── universal
│       ├── data_source_creator.py
│       ├── data_sources
│       │   ├── bigquery.py
│       │   ├── file.py
│       │   └── redshift.py
│       ├── entities.py
│       └── feature_views.py
├── offline_store
│   ├── test_s3_custom_endpoint.py
│   └── test_universal_historical_retrieval.py
├── online_store
│   ├── test_e2e_local.py
│   ├── test_feature_service_read.py
│   ├── test_online_retrieval.py
│   └── test_universal_online.py
├── registration
│   ├── test_cli.py
│   ├── test_cli_apply_duplicated_featureview_names.py
│   ├── test_cli_chdir.py
│   ├── test_feature_service_apply.py
│   ├── test_feature_store.py
│   ├── test_inference.py
│   ├── test_registry.py
│   ├── test_universal_odfv_feature_inference.py
│   └── test_universal_types.py
└── scaffolding
    ├── test_init.py
    ├── test_partial_apply.py
    ├── test_repo_config.py
    └── test_repo_operations.py

8 directories, 27 files
@pytest.mark.integration
@pytest.mark.parametrize("full_feature_names", [True, False], ids=lambda v: str(v))
def test_historical_features(environment, universal_data_sources, full_feature_names):
    store = environment.feature_store

    (entities, datasets, data_sources) = universal_data_sources
    feature_views = construct_universal_feature_views(data_sources)

    customer_df, driver_df, orders_df, global_df, entity_df = (
        datasets["customer"],
        datasets["driver"],
        datasets["orders"],
        datasets["global"],
        datasets["entity"],
    )
    # ... more test code

    customer_fv, driver_fv, driver_odfv, order_fv, global_fv = (
        feature_views["customer"],
        feature_views["driver"],
        feature_views["driver_odfv"],
        feature_views["order"],
        feature_views["global"],
    )

    feature_service = FeatureService(
        "convrate_plus100",
        features=[
            feature_views["driver"][["conv_rate"]],
            feature_views["driver_odfv"]
        ],
    )

    feast_objects = []
    feast_objects.extend(
        [
            customer_fv,
            driver_fv,
            driver_odfv,
            order_fv,
            global_fv,
            driver(),
            customer(),
            feature_service,
        ]
    )
    store.apply(feast_objects)
    # ... more test code

    job_from_df = store.get_historical_features(
        entity_df=entity_df_with_request_data,
        features=[
            "driver_stats:conv_rate",
            "driver_stats:avg_daily_trips",
            "customer_profile:current_balance",
            "customer_profile:avg_passenger_count",
            "customer_profile:lifetime_trip_count",
            "conv_rate_plus_100:conv_rate_plus_100",
            "conv_rate_plus_100:conv_rate_plus_val_to_add",
            "order:order_is_success",
            "global_stats:num_rides",
            "global_stats:avg_ride_length",
        ],
        full_feature_names=full_feature_names,
    )
    actual_df_from_df_entities = job_from_df.to_df()
    # ... more test code

    assert_frame_equal(
        expected_df, actual_df_from_df_entities, check_dtype=False,
    )
    # ... more test code
@pytest.mark.integration
def your_test(environment: Environment):
    df = #...#
    data_source = environment.data_source_creator.create_data_source(
        df,
        destination_name=environment.feature_store.project
    )
    your_fv = driver_feature_view(data_source)
    entity = driver(value_type=ValueType.UNKNOWN)
    fs.apply([fv, entity])

    # ... run test
Starting 6001
Starting 6002
Starting 6003
Starting 6004
Starting 6005
Starting 6006
feast_custom_offline_store/file.py
    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:
        print("Getting historical features from my offline store")
        return super().get_historical_features(config,
                                               feature_views,
                                               feature_refs,
                                               entity_df,
                                               registry,
                                               project,
                                               full_feature_names)

    def pull_latest_from_table_or_query(self,
                                        config: RepoConfig,
                                        data_source: DataSource,
                                        join_key_columns: List[str],
                                        feature_name_columns: List[str],
                                        timestamp_field: str,
                                        created_timestamp_column: Optional[str],
                                        start_date: datetime,
                                        end_date: datetime) -> RetrievalJob:
        print("Pulling latest features from my offline store")
        return super().pull_latest_from_table_or_query(config,
                                                       data_source,
                                                       join_key_columns,
                                                       feature_name_columns,
                                                       timestamp_field=timestamp_field,
                                                       created_timestamp_column,
                                                       start_date,
                                                       end_date)
feast_custom_offline_store/file.py
class CustomFileOfflineStoreConfig(FeastConfigBaseModel):
    """ Custom offline store config for local (file-based) store """

    type: Literal["feast_custom_offline_store.file.CustomFileOfflineStore"] \
        = "feast_custom_offline_store.file.CustomFileOfflineStore"
feature_repo/feature_store.yaml
type: feast_custom_offline_store.file.CustomFileOfflineStore
feast_custom_offline_store/file.py
    def get_historical_features(self,
                                config: RepoConfig,
                                feature_views: List[FeatureView],
                                feature_refs: List[str],
                                entity_df: Union[pd.DataFrame, str],
                                registry: Registry, project: str,
                                full_feature_names: bool = False) -> RetrievalJob:

        offline_store_config = config.offline_store
        assert isinstance(offline_store_config, CustomFileOfflineStoreConfig)
        store_type = offline_store_config.type
feast_custom_offline_store/file.py
class CustomFileRetrievalJob(RetrievalJob):
    def __init__(self, evaluation_function: Callable):
        """Initialize a lazy historical retrieval job"""

        # The evaluation function executes a stored procedure to compute a historical retrieval.
        self.evaluation_function = evaluation_function

    def to_df(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return df

    def to_arrow(self):
        # Only execute the evaluation function to build the final historical retrieval dataframe at the last moment.
        print("Getting a pandas DataFrame from a File is easy!")
        df = self.evaluation_function()
        return pyarrow.Table.from_pandas(df)
feast_custom_offline_store/file.py
class CustomFileDataSource(FileSource):
    """Custom data source class for local files"""
    def __init__(
        self,
        timestamp_field: Optional[str] = "",
        path: Optional[str] = None,
        field_mapping: Optional[Dict[str, str]] = None,
        created_timestamp_column: Optional[str] = "",
        date_partition_column: Optional[str] = "",
    ):
        super(CustomFileDataSource, self).__init__(
            timestamp_field=timestamp_field,
            created_timestamp_column,
            field_mapping,
            date_partition_column,
        )
        self._path = path


    @staticmethod
    def from_proto(data_source: DataSourceProto):
        custom_source_options = str(
            data_source.custom_options.configuration, encoding="utf8"
        )
        path = json.loads(custom_source_options)["path"]
        return CustomFileDataSource(
            field_mapping=dict(data_source.field_mapping),
            path=path,
            timestamp_field=data_source.timestamp_field,
            created_timestamp_column=data_source.created_timestamp_column,
            date_partition_column=data_source.date_partition_column,
        )

    def to_proto(self) -> DataSourceProto:
        config_json = json.dumps({"path": self.path})
        data_source_proto = DataSourceProto(
            type=DataSourceProto.CUSTOM_SOURCE,
            custom_options=DataSourceProto.CustomSourceOptions(
                configuration=bytes(config_json, encoding="utf8")
            ),
        )

        data_source_proto.timestamp_field = self.timestamp_field
        data_source_proto.created_timestamp_column = self.created_timestamp_column
        data_source_proto.date_partition_column = self.date_partition_column

        return data_source_proto
feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
offline_store:
    type: feast_custom_offline_store.file.CustomFileOfflineStore
feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
offline_store: feast_custom_offline_store.file.CustomFileOfflineStore
feature_repo/repo.py
pdriver_hourly_stats = CustomFileDataSource(
    path="feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)


driver_hourly_stats_view = FeatureView(
    source=driver_hourly_stats,
    ...
)
make test
make test-python-universal
export FULL_REPO_CONFIGS_MODULE='feast_custom_offline_store.feast_tests'
make test-python-universal
from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    database="FEAST",
    schema="PUBLIC",
    table="FEATURE_TABLE",
)
from feast import SnowflakeSource

my_snowflake_source = SnowflakeSource(
    query="""
    SELECT
        timestamp_column AS "ts",
        "created",
        "f1",
        "f2"
    FROM
        `FEAST.PUBLIC.FEATURE_TABLE`
      """,
)
here
custom offline store demo
custom online store demo
OfflineStore
feast-dev/feast-custom-offline-store-demo
here
pydantic
base class
here
here
File
Snowflake
BigQuery
Redshift
Push
Spark (contrib)
PostgreSQL (contrib)
Data Source

Adding a new online store

Overview

In this guide, we will show you how to integrate with MySQL as an online store. While we will be implementing a specific store, this guide should be representative for adding support for any new online store.

The process of using a custom online store consists of 3 steps:

  1. Defining the OnlineStore class.

  2. Defining the OnlineStoreConfig class.

  3. Referencing the OnlineStore in a feature repo's feature_store.yaml file.

  4. Testing the OnlineStore class.

1. Defining an OnlineStore class

OnlineStore class names must end with the OnlineStore suffix!

The OnlineStore class broadly contains two sets of methods

  • One set deals with managing infrastructure that the online store needed for operations

  • One set deals with writing data into the store, and reading data from the store.

1.1 Infrastructure Methods

There are two methods that deal with managing infrastructure for online stores, update and teardown

  • update is invoked when users run feast apply as a CLI command, or the FeatureStore.apply() sdk method.

The update method should be used to perform any operations necessary before data can be written to or read from the store. The update method can be used to create MySQL tables in preparation for reads and writes to new feature views.

  • teardown is invoked when users run feast teardown or FeatureStore.teardown().

The teardown method should be used to perform any clean-up operations. teardown can be used to drop MySQL indices and tables corresponding to the feature views being deleted.

feast_custom_online_store/mysql.py
def update(
    self,
    config: RepoConfig,
    tables_to_delete: Sequence[Union[FeatureTable, FeatureView]],
    tables_to_keep: Sequence[Union[FeatureTable, FeatureView]],
    entities_to_delete: Sequence[Entity],
    entities_to_keep: Sequence[Entity],
    partial: bool,
):
    """
    An example of creating manging the tables needed for a mysql-backed online store.
    """
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    project = config.project

    for table in tables_to_keep:
        cur.execute(
            f"CREATE TABLE IF NOT EXISTS {_table_id(project, table)} (entity_key VARCHAR(512), feature_name VARCHAR(256), value BLOB, event_ts timestamp, created_ts timestamp,  PRIMARY KEY(entity_key, feature_name))"
        )
        cur.execute(
            f"CREATE INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)} (entity_key);"
        )

    for table in tables_to_delete:
        cur.execute(
            f"DROP INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)};"
        )
        cur.execute(f"DROP TABLE IF EXISTS {_table_id(project, table)}")


def teardown(
    self,
    config: RepoConfig,
    tables: Sequence[Union[FeatureTable, FeatureView]],
    entities: Sequence[Entity],
):
    """
    
    """
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)
    project = config.project

    for table in tables:
        cur.execute(
            f"DROP INDEX {_table_id(project, table)}_ek ON {_table_id(project, table)};"
        )
        cur.execute(f"DROP TABLE IF EXISTS {_table_id(project, table)}")

1.2 Read/Write Methods

There are two methods that deal with writing data to and from the online stores.online_write_batch and online_read.

  • online_write_batch is invoked when running materialization (using the feast materialize or feast materialize-incremental commands, or the corresponding FeatureStore.materialize() method.

  • online_read is invoked when reading values from the online store using the FeatureStore.get_online_features() method.

feast_custom_online_store/mysql.py
def online_write_batch(
    self,
    config: RepoConfig,
    table: Union[FeatureTable, FeatureView],
    data: List[
        Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
    ],
    progress: Optional[Callable[[int], Any]],
) -> None:
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    project = config.project

    for entity_key, values, timestamp, created_ts in data:
        entity_key_bin = serialize_entity_key(entity_key).hex()
        timestamp = _to_naive_utc(timestamp)
        if created_ts is not None:
            created_ts = _to_naive_utc(created_ts)

        for feature_name, val in values.items():
            self.write_to_table(created_ts, cur, entity_key_bin, feature_name, project, table, timestamp, val)
        self._conn.commit()
        if progress:
            progress(1)

def online_read(
    self,
    config: RepoConfig,
    table: Union[FeatureTable, FeatureView],
    entity_keys: List[EntityKeyProto],
    requested_features: Optional[List[str]] = None,
) -> List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]]:
    conn = self._get_conn(config)
    cur = conn.cursor(buffered=True)

    result: List[Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]] = []

    project = config.project
    for entity_key in entity_keys:
        entity_key_bin = serialize_entity_key(entity_key).hex()
        print(f"entity_key_bin: {entity_key_bin}")

        cur.execute(
            f"SELECT feature_name, value, event_ts FROM {_table_id(project, table)} WHERE entity_key = %s",
            (entity_key_bin,),
        )

        res = {}
        res_ts = None
        for feature_name, val_bin, ts in cur.fetchall():
            val = ValueProto()
            val.ParseFromString(val_bin)
            res[feature_name] = val
            res_ts = ts

        if not res:
            result.append((None, None))
        else:
            result.append((res_ts, res))
    return result

2. Defining an OnlineStoreConfig class

Additional configuration may be needed to allow the OnlineStore to talk to the backing store. For example, MySQL may need configuration information like the host at which the MySQL instance is running, credentials for connecting to the database, etc.

This config class must container a type field, which contains the fully qualified class name of its corresponding OnlineStore class.

Additionally, the name of the config class must be the same as the OnlineStore class, with the Config suffix.

An example of the config class for MySQL :

feast_custom_online_store/mysql.py
class MySQLOnlineStoreConfig(FeastConfigBaseModel):
    type: Literal["feast_custom_online_store.mysql.MySQLOnlineStore"] = "feast_custom_online_store.mysql.MySQLOnlineStore"

    host: Optional[StrictStr] = None
    user: Optional[StrictStr] = None
    password: Optional[StrictStr] = None
    database: Optional[StrictStr] = None

This configuration can be specified in the feature_store.yaml as follows:

feature_repo/feature_store.yaml
online_store:
    type: feast_custom_online_store.mysql.MySQLOnlineStore
    user: foo
    password: bar

This configuration information is available to the methods of the OnlineStore, via theconfig: RepoConfig parameter which is passed into all the methods of the OnlineStore interface, specifically at the config.online_store field of the config parameter.

feast_custom_online_store/mysql.py
def online_write_batch(
        self,
        config: RepoConfig,
        table: Union[FeatureTable, FeatureView],
        data: List[
            Tuple[EntityKeyProto, Dict[str, ValueProto], datetime, Optional[datetime]]
        ],
        progress: Optional[Callable[[int], Any]],
) -> None:

    online_store_config = config.online_store
    assert isinstance(online_store_config, MySQLOnlineStoreConfig)

    connection = mysql.connector.connect(
        host=online_store_config.host or "127.0.0.1",
        user=online_store_config.user or "root",
        password=online_store_config.password,
        database=online_store_config.database or "feast",
        autocommit=True
    )

3. Using the custom online store

After implementing both these classes, the custom online store can be used by referencing it in a feature repo's feature_store.yaml file, specifically in the online_store field. The value specified should be the fully qualified class name of the OnlineStore.

As long as your OnlineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.

To use our MySQL online store, we can use the following feature_store.yaml:

feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
online_store: 
    type: feast_custom_online_store.mysql.MySQLOnlineStore
    user: foo
    password: bar

If additional configuration for the online store is **not **required, then we can omit the other fields and only specify the type of the online store class as the value for the online_store.

feature_repo/feature_store.yaml
project: test_custom
registry: data/registry.db
provider: local
online_store: feast_custom_online_store.mysql.MySQLOnlineStore

4. Testing the OnlineStore class

Even if you have created the OnlineStore class in a separate repo, you can still test your implementation against the Feast test suite, as long as you have Feast as a submodule in your repo. In the Feast submodule, we can run all the unit tests with:

make test

The universal tests, which are integration tests specifically intended to test offline and online stores, can be run with:

make test-python-universal

The unit tests should succeed, but the universal tests will likely fail. The tests are parametrized based on the FULL_REPO_CONFIGS variable defined in sdk/python/tests/integration/feature_repos/repo_configuration.py. To overwrite these configurations, you can simply create your own file that contains a FULL_REPO_CONFIGS, and point Feast to that file by setting the environment variable FULL_REPO_CONFIGS_MODULE to point to that file. In this repo, the file that overwrites FULL_REPO_CONFIGS is feast_custom_online_store/feast_tests.py, so you would run

export FULL_REPO_CONFIGS_MODULE='feast_custom_online_store.feast_tests'
make test-python-universal

to test the MySQL online store against the Feast universal tests. You should notice that some of the tests actually fail; this indicates that there is a mistake in the implementation of this online store!

Push

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Description

Push sources can be used by multiple feature views. When data is pushed to a push source, Feast propagates the feature values to all the consuming feature views.

Push sources must have a batch source specified, since that's the source used when retrieving historical features. When using a PushSource as a stream source in the definition of a feature view, a batch source doesn't need to be specified in the definition explicitly.

Stream sources

Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:

  1. Raw events come in (stream 1)

  2. Streaming transformations applied (e.g. generating features like last_N_purchased_categories) (stream 2)

  3. Write stream 2 values to an offline store as a historical log for training

  4. Write stream 2 values to an online store for low latency feature serving

  5. Periodically materialize feature values from the offline store into the online store for improved correctness

Feast now allows users to push features previously registered in a feature view to the online store for fresher features.

Example

Defining a push source

Note that the push schema needs to also include the entity

from feast import PushSource, ValueType, BigQuerySource, FeatureView, Feature, Field
from feast.types import Int64

push_source = PushSource(
    name="push_source",
    batch_source=BigQuerySource(table="test.test"),
)

fv = FeatureView(
    name="feature view",
    entities=["user_id"],
    schema=[Field(name="life_time_value", dtype=Int64)],
    source=push_source,
)

Pushing data

from feast import FeatureStore
import pandas as pd

fs = FeatureStore(...)
feature_data_frame = pd.DataFrame()
fs.push("push_source_name", feature_data_frame)

File

Description

File data sources allow for the retrieval of historical feature values from files on disk for building training datasets, as well as for materializing features into an online store.

FileSource is meant for development purposes only and is not optimized for production use.

Example

from feast import FileSource
from feast.data_format import ParquetFormat

parquet_file_source = FileSource(
    file_format=ParquetFormat(),
    path="file:///feast/customer.parquet",
)

File

Description

  • Only Parquet files are currently supported.

  • All data is downloaded and joined using Python and may not scale to production workloads.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: file

Feast makes adding support for a new online store (database) easy. Developers can simply implement the interface to add support for a new store (other than the existing stores like Redis, DynamoDB, SQLite, and Datastore).

The full working code for this guide can be found at .

To facilitate configuration, all OnlineStore implementations are required to also define a corresponding OnlineStoreConfig class in the same file. This OnlineStoreConfig class should inherit from the FeastConfigBaseModel class, which is defined .

The FeastConfigBaseModel is a class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.

Push sources allow feature values to be pushed to the online store in real time. This allows fresh feature values to be made available to applications. Push sources supercede the .

See also for instructions on how to push data to a deployed feature server.

Configuration options are available .

The File offline store provides support for reading .

Configuration options are available .

OnlineStore
feast-dev/feast-custom-online-store-demo
here
pydantic
FeatureStore.write_to_online_store
Python feature server
here
FileSources
here

Spark (contrib)

Description

NOTE: Spark data source api is currently in alpha development and the API is not completely stable. The API may change or update in the future.

The spark data source API allows for the retrieval of historical feature values from file/database sources for building training datasets as well as materializing features into an online store.

  • Either a table name, a SQL query, or a file path can be provided.

Examples

Using a table reference from SparkSession(for example, either in memory or a Hive Metastore)

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    table="FEATURE_TABLE",
)

Using a query

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM spark_table",
)

Using a file reference

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    path=f"{CURRENT_DIR}/data/driver_hourly_stats",
    file_format="parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

PostgreSQL (contrib)

Description

NOTE: The Postgres plugin is a contrib plugin. This means it may not be fully stable.

The PostgreSQL data source allows for the retrieval of historical feature values from a PostgreSQL database for building training datasets as well as materializing features into an online store.

Examples

Defining a Postgres source

from feast.infra.offline_stores.contrib.postgres_offline_store.postgres_source import (
    PostgreSQLSource,
)

driver_stats_source = PostgreSQLSource(
    name="feast_driver_hourly_stats",
    query="SELECT * FROM feast_driver_hourly_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

Offline stores

BigQuery

Description

  • BigQuery tables and views are allowed as sources.

  • All joins happen within BigQuery.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be uploaded to BigQuery in order to complete join operations.

Example

BigQuery

Description

BigQuery data sources allow for the retrieval of historical feature values from BigQuery for building training datasets as well as materializing features into an online store.

  • Either a table reference or a SQL query can be provided.

  • No performance guarantees can be provided over SQL query-based sources. Please use table references where possible.

Examples

Using a table reference

Using a query

Please see for an explanation of offline stores.

The BigQuery offline store provides support for reading .

A is returned when calling get_historical_features().

Configuration options are available .

Configuration options are available .

Offline Store
File
Snowflake
BigQuery
Redshift
Spark (contrib)
PostgreSQL (contrib)
feature_store.yaml
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp
offline_store:
  type: bigquery
  dataset: feast_bq_dataset
from feast import BigQuerySource

my_bigquery_source = BigQuerySource(
    table_ref="gcp_project:bq_dataset.bq_table",
)
from feast import BigQuerySource

BigQuerySource(
    query="SELECT timestamp as ts, created, f1, f2 "
          "FROM `my_project.my_dataset.my_features`",
)
BigQuerySources
BigQueryRetrievalJob
here
here

Spark (contrib)

Description

Disclaimer

This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.

  • Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.

  • A SparkRetrievalJob is returned when calling get_historical_features().

    • This allows you to call

      • to_df to retrieve the pandas dataframe.

      • to_arrow to retrieve the dataframe as a pyarrow Table.

      • to_spark_df to retrieve the dataframe the spark.

Example

feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: spark
    spark_conf:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.sql.parser.quotedRegexColumnNames: "true"
        spark.sql.session.timeZone: "UTC"
online_store:
    path: data/online_store.db

The Spark offline store is an offline store currently in alpha development that provides support for reading .

SparkSources

Snowflake

Description

  • Snowflake tables and views are allowed as sources.

  • All joins happen within Snowflake.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be uploaded to Snowflake in order to complete join operations.

  • A SnowflakeRetrievalJob is returned when calling get_historical_features().

    • This allows you to call

      • to_snowflake to save the dataset into Snowflake

      • to_sql to get the SQL query that would execute on to_df

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: snowflake.offline
  account: snowflake_deployment.us-east-1
  user: user_login
  password: user_password
  role: sysadmin
  warehouse: demo_wh
  database: FEAST

The Snowflake offline store provides support for reading .

to_arrow_chunks to get the result in batches ()

Configuration options are available in .

SnowflakeSources
Snowflake python connector docs
SnowflakeOfflineStoreConfig

Datastore

Description

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: gcp
online_store:
  type: datastore
  project_id: my_gcp_project
  namespace: my_datastore_namespace

The online store provides support for materializing feature values into Cloud Datastore. The data model used to store feature values in Datastore is described in more detail .

Configuration options are available .

Datastore
here
here

SQLite

Description

  • All feature values are stored in an on-disk SQLite database

  • Only the latest feature values are persisted

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: sqlite
  path: data/online_store.db

PostgreSQL (contrib)

Description

DISCLAIMER: This PostgreSQL offline store still does not achieve full test coverage.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.

  • A PostgreSQLRetrievalJob is returned when calling get_historical_features().

    • This allows you to call

      • to_df to retrieve the pandas dataframe.

      • to_arrow to retrieve the dataframe as a PyArrow table.

      • to_sql to get the SQL query used to pull the features.

  • sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional

Example

The online store provides support for materializing feature values into an SQLite database for serving online features.

Configuration options are available .

The PostgreSQL offline store is an offline store that provides support for reading data sources.

SQLite
here
feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
  type: postgres
  host: DB_HOST
  port: DB_PORT
  database: DB_NAME
  db_schema: DB_SCHEMA
  user: DB_USERNAME
  password: DB_PASSWORD
  sslmode: verify-ca
  sslkey_path: /path/to/client-key.pem
  sslcert_path: /path/to/client-cert.pem
  sslrootcert_path: /path/to/server-ca.pem
online_store:
    path: data/online_store.db
PostgreSQL

DynamoDB

Description

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
online_store:
  type: dynamodb
  region: us-west-2

Permissions

Feast requires the following permissions in order to execute commands for DynamoDB online store:

Command

Permissions

Resources

Apply

dynamodb:CreateTable

dynamodb:DescribeTable

dynamodb:DeleteTable

arn:aws:dynamodb:<region>:<account_id>:table/*

Materialize

dynamodb.BatchWriteItem

arn:aws:dynamodb:<region>:<account_id>:table/*

Get Online Features

dynamodb.BatchGetItem

arn:aws:dynamodb:<region>:<account_id>:table/*

The following inline policy can be used to grant Feast the necessary permissions:

{
    "Statement": [
        {
            "Action": [
                "dynamodb:CreateTable",
                "dynamodb:DescribeTable",
                "dynamodb:DeleteTable",
                "dynamodb:BatchWriteItem",
                "dynamodb:BatchGetItem"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:dynamodb:<region>:<account_id>:table/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Online stores

The online store provides support for materializing feature values into AWS DynamoDB.

Configuration options are available .

Lastly, this IAM role needs to be associated with the desired Redshift cluster. Please follow the official AWS guide for the necessary steps .

Please see for an explanation of online stores.

DynamoDB
here
here
Online Store
SQLite
Redis
Datastore
DynamoDB
PostgreSQL (contrib)

PostgreSQL (contrib)

Description

The PostgreSQL online store provides support for materializing feature values into a PostgreSQL database for serving online features.

  • Only the latest feature values are persisted

  • sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
    type: postgres
    host: DB_HOST
    port: DB_PORT
    database: DB_NAME
    db_schema: DB_SCHEMA
    user: DB_USERNAME
    password: DB_PASSWORD
    sslmode: verify-ca
    sslkey_path: /path/to/client-key.pem
    sslcert_path: /path/to/client-cert.pem
    sslrootcert_path: /path/to/server-ca.pem

Configuration options are available .

here

Redshift

Description

  • Redshift tables and views are allowed as sources.

  • All joins happen within Redshift.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be uploaded to Redshift in order to complete join operations.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
offline_store:
  type: redshift
  region: us-west-2
  cluster_id: feast-cluster
  database: feast-database
  user: redshift-user
  s3_staging_location: s3://feast-bucket/redshift
  iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role

Permissions

Feast requires the following permissions in order to execute commands for Redshift offline store:

Command

Permissions

Resources

Apply

redshift-data:DescribeTable

redshift:GetClusterCredentials

arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Materialize

redshift-data:ExecuteStatement

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Materialize

redshift-data:DescribeStatement

*

Materialize

s3:ListBucket

s3:GetObject

s3:DeleteObject

arn:aws:s3:::<bucket_name>

arn:aws:s3:::<bucket_name>/*

Get Historical Features

redshift-data:ExecuteStatement

redshift:GetClusterCredentials

arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

Get Historical Features

redshift-data:DescribeStatement

*

Get Historical Features

s3:ListBucket

s3:GetObject

s3:PutObject

s3:DeleteObject

arn:aws:s3:::<bucket_name>

arn:aws:s3:::<bucket_name>/*

The following inline policy can be used to grant Feast the necessary permissions:

{
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<bucket_name>/*",
                "arn:aws:s3:::<bucket_name>"
            ]
        },
        {
            "Action": [
                "redshift-data:DescribeTable",
                "redshift:GetClusterCredentials",
                "redshift-data:ExecuteStatement"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>",
                "arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>",
                "arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>"
            ]
        },
        {
            "Action": [
                "redshift-data:DescribeStatement"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}

The following inline policy can be used to grant Redshift necessary permissions to access S3:

{
    "Statement": [
        {
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::feast-integration-tests",
                "arn:aws:s3:::feast-integration-tests/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The Redshift offline store provides support for reading .

A is returned when calling get_historical_features().

Configuration options are available .

In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run and commands. Once created, this IAM role needs to be configured in feature_store.yaml file as offline_store: iam_role.

RedshiftSources
RedshiftRetrievalJob
here
UNLOAD
COPY

Local

Description

  • Offline Store: Uses the File offline store by default. Also supports BigQuery as the offline store.

  • Online Store: Uses the Sqlite online store by default. Also supports Redis and Datastore as online stores.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local

Providers

Redis

Description

  • Both Redis and Redis Cluster are supported

Examples

Connecting to a single Redis instance

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"

Connecting to a Redis Cluster with SSL enabled and password authentication

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  redis_type: redis_cluster
  connection_string: "redis1:6379,redis2:6379,ssl=true,password=my_password"

Please see for an explanation of providers.

The online store provides support for materializing feature values into Redis.

The data model used to store feature values in Redis is described in more detail .

Configuration options are available .

Provider
Local
Google Cloud Platform
Amazon Web Services
Redis
here
here

Google Cloud Platform

Description

  • Offline Store: Uses the BigQuery offline store by default. Also supports File as the offline store.

  • Online Store: Uses the Datastore online store by default. Also supports Sqlite as an online store.

Example

feature_store.yaml
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp

Permissions

Command

Component

Permissions

Recommended Role

Apply

BigQuery (source)

bigquery.jobs.create

bigquery.readsessions.create

bigquery.readsessions.getData

roles/bigquery.user

Apply

Datastore (destination)

datastore.entities.allocateIds

datastore.entities.create

datastore.entities.delete

datastore.entities.get

datastore.entities.list

datastore.entities.update

roles/datastore.owner

Materialize

BigQuery (source)

bigquery.jobs.create

roles/bigquery.user

Materialize

Datastore (destination)

datastore.entities.allocateIds

datastore.entities.create

datastore.entities.delete

datastore.entities.get

datastore.entities.list

datastore.entities.update

datastore.databases.get

roles/datastore.owner

Get Online Features

Datastore

datastore.entities.get

roles/datastore.user

Get Historical Features

BigQuery (source)

bigquery.datasets.get

bigquery.tables.get

bigquery.tables.create

bigquery.tables.updateData

bigquery.tables.update

bigquery.tables.delete

bigquery.tables.getData

roles/bigquery.dataEditor

Amazon Web Services

Description

  • Offline Store: Uses the Redshift offline store by default. Also supports File as the offline store.

  • Online Store: Uses the DynamoDB online store by default. Also supports Sqlite as an online store.

Example

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: aws
online_store:
  type: dynamodb
  region: us-west-2
offline_store:
  type: redshift
  region: us-west-2
  cluster_id: feast-cluster
  database: feast-database
  user: redshift-user
  s3_staging_location: s3://feast-bucket/redshift
  iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role

Feature servers

Feast users can choose to retrieve features from a feature server, as opposed to through the Python SDK.

Python feature server

Feature repository

Feast users use Feast to manage two important sets of configuration:

  • Configuration about how to run Feast on your infrastructure

  • Feature definitions

With Feast, the above configuration can be written declaratively and stored as code in a central location. This central location is called a feature repository. The feature repository is the declarative source of truth for what the desired state of a feature store should be.

The Feast CLI uses the feature repository to configure, deploy, and manage your feature store.

What is a feature repository?

A feature repository consists of:

  • A collection of Python files containing feature declarations.

  • A feature_store.yaml file containing infrastructural configuration.

  • A .feastignore file containing paths in the feature repository to ignore.

Typically, users store their feature repositories in a Git repository, especially when working in teams. However, using Git is not a requirement.

Structure of a feature repository

The structure of a feature repository is as follows:

  • The root of the repository should contain a feature_store.yaml file and may contain a .feastignore file.

  • The repository should contain Python files that contain feature definitions.

  • The repository can contain other files as well, including documentation and potentially data files.

An example structure of a feature repository is shown below:

$ tree -a
.
├── data
│   └── driver_stats.parquet
├── driver_features.py
├── feature_store.yaml
└── .feastignore

1 directory, 4 files

A couple of things to note about the feature repository:

  • Feast reads all Python files recursively when feast apply is ran, including subdirectories, even if they don't contain feature definitions.

  • It's recommended to add .feastignore and add paths to all imperative scripts if you need to store them inside the feature registry.

The feature_store.yaml configuration file

The configuration for a feature store is stored in a file named feature_store.yaml , which must be located at the root of a feature repository. An example feature_store.yaml file is shown below:

feature_store.yaml
project: my_feature_repo_1
registry: data/metadata.db
provider: local
online_store:
    path: data/online_store.db

The .feastignore file

This file contains paths that should be ignored when running feast apply. An example .feastignore is shown below:

.feastignore
# Ignore virtual environment
venv

# Ignore a specific Python file
scripts/foo.py

# Ignore all Python files directly under scripts directory
scripts/*.py

# Ignore all "foo.py" anywhere under scripts directory
scripts/**/foo.py

Feature definitions

A feature repository can also contain one or more Python files that contain feature definitions. An example feature definition file is shown below:

driver_features.py
from datetime import timedelta

from feast import BigQuerySource, Entity, Feature, FeatureView, Field, ValueType
from feast.types import Float32, String

driver_locations_source = BigQuerySource(
    table_ref="rh_prod.ride_hailing_co.drivers",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp",
)

driver = Entity(
    name="driver",
    value_type=ValueType.INT64,
    description="driver id",
)

driver_locations = FeatureView(
    name="driver_locations",
    entities=["driver"],
    ttl=timedelta(days=1),
    schema=[
        Field(name="lat", dtype=Float32),
        Field(name="lon", dtype=String),
    ],
    source=driver_locations_source,
)

Next steps

The feature_store.yaml file configures how the feature store should run. See for more details.

See for more details.

To declare new feature definitions, just add code to the feature repository, either in existing files or in a new file. For more information on how to define features, see .

See to get started with an example feature repository.

See , , or for more information on the configuration files that live in a feature registry.

feature_store.yaml
.feastignore
Feature Views
Create a feature repository
feature_store.yaml
.feastignore
Feature Views
feature_store.yaml

.feastignore

Overview

.feastignore
# Ignore virtual environment
venv

# Ignore a specific Python file
scripts/foo.py

# Ignore all Python files directly under scripts directory
scripts/*.py

# Ignore all "foo.py" anywhere under scripts directory
scripts/**/foo.py

.feastignore file is optional. If the file can not be found, every Python file in the feature repo directory will be parsed by feast apply.

Feast Ignore Patterns

Pattern
Example matches
Explanation

venv

venv/foo.py venv/a/foo.py

You can specify a path to a specific directory. Everything in that directory will be ignored.

scripts/foo.py

scripts/foo.py

You can specify a path to a specific file. Only that file will be ignored.

scripts/*.py

scripts/foo.py scripts/bar.py

You can specify an asterisk (*) anywhere in the expression. An asterisk matches zero or more characters, except "/".

scripts/**/foo.py

scripts/foo.py scripts/a/foo.py scripts/a/b/foo.py

You can specify a double asterisk (**) anywhere in the expression. A double asterisk matches zero or more directories.

feature_store.yaml

Overview

feature_store.yaml
project: loyal_spider
registry: data/registry.db
provider: local
online_store:
    type: sqlite
    path: data/online_store.db

Options

The following top-level configuration options exist in the feature_store.yaml file.

  • provider — Configures the environment in which Feast will deploy and operate.

  • registry — Configures the location of the feature registry.

  • online_store — Configures the online store.

  • offline_store — Configures the offline store.

  • project — Defines a namespace for the entire feature store. Can be used to isolate multiple deployments in a single installation of Feast. Should only contain letters, numbers, and underscores.

.feastignore is a file that is placed at the root of the . This file contains paths that should be ignored when running feast apply. An example .feastignore is shown below:

feature_store.yaml is used to configure a feature store. The file must be located at the root of a . An example feature_store.yaml is shown below:

Please see the API reference for the full list of configuration options.

Feature Repository
feature repository
RepoConfig

Python feature server

Overview

The feature server is an HTTP endpoint that serves features with JSON I/O. This enables users to write + read features from Feast online stores using any programming language that can make HTTP requests.

CLI

There is a CLI command that starts the server: feast serve. By default, Feast uses port 6566; the port be overridden by a --port flag.

Deploying as a service

Example

Initializing a feature server

Here's the local feature server usage example with the local template:

$ feast init feature_repo
Creating a new Feast repository in /home/tsotne/feast/feature_repo.

$ cd feature_repo

$ feast apply
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats

$ feast materialize-incremental $(date +%Y-%m-%d)
Materializing 1 feature views to 2021-09-09 17:00:00-07:00 into the sqlite online store.

driver_hourly_stats from 2021-09-09 16:51:08-07:00 to 2021-09-09 17:00:00-07:00:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 295.24it/s]

$ feast serve
This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.
INFO:     Started server process [8889]
09/10/2021 10:42:11 AM INFO:Started server process [8889]
INFO:     Waiting for application startup.
09/10/2021 10:42:11 AM INFO:Waiting for application startup.
INFO:     Application startup complete.
09/10/2021 10:42:11 AM INFO:Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:6566 (Press CTRL+C to quit)
09/10/2021 10:42:11 AM INFO:Uvicorn running on http://127.0.0.1:6566 (Press CTRL+C to quit)

Retrieving features from the online store

After the server starts, we can execute cURL commands from another terminal tab:

$  curl -X POST \
  "http://localhost:6566/get-online-features" \
  -d '{
    "features": [
      "driver_hourly_stats:conv_rate",
      "driver_hourly_stats:acc_rate",
      "driver_hourly_stats:avg_daily_trips"
    ],
    "entities": {
      "driver_id": [1001, 1002, 1003]
    }
  }' | jq
{
  "metadata": {
    "feature_names": [
      "driver_id",
      "conv_rate",
      "avg_daily_trips",
      "acc_rate"
    ]
  },
  "results": [
    {
      "values": [
        1001,
        0.7037263512611389,
        308,
        0.8724706768989563
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    },
    {
      "values": [
        1002,
        0.038169607520103455,
        332,
        0.48534533381462097
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    },
    {
      "values": [
        1003,
        0.9665873050689697,
        779,
        0.7793770432472229
      ],
      "statuses": [
        "PRESENT",
        "PRESENT",
        "PRESENT",
        "PRESENT"
      ],
      "event_timestamps": [
        "1970-01-01T00:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z",
        "2021-12-31T23:00:00Z"
      ]
    }
  ]
}

It's also possible to specify a feature service name instead of the list of features:

curl -X POST \
  "http://localhost:6566/get-online-features" \
  -d '{
    "feature_service": <feature-service-name>,
    "entities": {
      "driver_id": [1001, 1002, 1003]
    }
  }' | jq

Pushing features to the online store

You can push data corresponding to a push source to the online store (note that timestamps need to be strings):

curl -X POST "http://localhost:6566/push" -d '{
    "push_source_name": "driver_hourly_stats_push_source",
    "df": {
            "driver_id": [1001],
            "event_timestamp": ["2022-05-13 10:59:42"],
            "created": ["2022-05-13 10:59:42"],
            "conv_rate": [1.0],
            "acc_rate": [1.0],
            "avg_daily_trips": [1000]
    }
  }' | jq

or equivalently from Python:

import json
import requests
import pandas as pd
from datetime import datetime

event_dict = {
    "driver_id": [1001],
    "event_timestamp": [str(datetime(2021, 5, 13, 10, 59, 42))],
    "created": [str(datetime(2021, 5, 13, 10, 59, 42))],
    "conv_rate": [1.0],
    "acc_rate": [1.0],
    "avg_daily_trips": [1000],
    "string_feature": "test2",
}
push_data = {
    "push_source_name":"driver_stats_push_source",
    "df":event_dict
}
requests.post(
    "http://localhost:6566/push", 
    data=json.dumps(push_data))

One can also deploy a feature server by building a docker image that bundles in the project's feature_store.yaml. See for example.

A on AWS Lambda is available. A remote feature server on GCP Cloud Run is currently being developed.

helm chart
remote feature server

Go-based feature retrieval

Overview

Currently, this component only supports online serving and does not have an offline component including APIs to create feast feature repositories or apply configuration to the registry to facilitate online materialization. It also does not expose its own dedicated cli to perform feast actions. Furthermore, this component is only meant to expose an online serving API that can be called through the python SDK to facilitate faster online feature retrieval.

Installation

As long as you are running macOS or linux, on x86, with python version 3.7-3.10, the go component comes pre-compiled when you install feast.

However, some additional dependencies are required for Go <-> Python interoperability. To install these dependencies run the following command in your console:

pip install feast[go]

For developers, if you want to build from source, run make compile-go-lib to build and compile the go server.

Usage

To enable the Go online feature retrieval component, set go_feature_retrieval: True in your feature_store.yaml. This will direct all online feature retrieval to Go instead of Python. This flag will be enabled by default in the future.

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
go_feature_retrieval: True

Future/Current Work

We also plan on adding support for the Java feature server (e.g. the capability to call into the Go component and execute Java UDFs).

Feast CLI reference

Overview

Global Options

The Feast CLI provides one global top-level option that can be used with other commands

chdir (-c, --chdir)

This command allows users to run Feast CLI commands in a different folder from the current working directory.

Apply

Creates or updates a feature store deployment

What does Feast apply do?

  1. Feast will scan Python files in your feature repository and find all Feast object definitions, such as feature views, entities, and data sources.

  2. Feast will validate your feature definitions (e.g. for uniqueness of features)

  3. Feast will sync the metadata about Feast objects to the registry. If a registry does not exist, then it will be instantiated. The standard registry is a simple protobuf binary file that is stored on disk (locally or in an object store).

  4. Feast CLI will create all necessary feature store infrastructure. The exact infrastructure that is deployed or configured depends on the provider configuration that you have set in feature_store.yaml. For example, setting local as your provider will result in a sqlite online store being created.

feast apply (when configured to use cloud provider like gcp or aws) will create cloud infrastructure. This may incur costs.

Entities

List all registered entities

Feature views

List all registered feature views

Init

Creates a new feature repository

It's also possible to use other templates

or to set the name of the new project

Materialize

Load data from feature views into the online store between two dates

Load data for specific feature views into the online store between two dates

Materialize incremental

Load data from feature views into the online store, beginning from either the previous materialize or materialize-incremental end date, or the beginning of time.

Teardown

Tear down deployed feature store infrastructure

Version

Print the current Feast version

The Go Feature Retrieval component is a Go implementation of the core feature serving logic, embedded in the Python SDK. It supports retrieval of feature references, feature services, and on demand feature views, and can be used either through the Python SDK or the .

The Go Feature Retrieval component currently only supports Redis and Sqlite as online stores; support for other online stores will be added soon. Initial benchmarks indicate that it is significantly faster than the Python feature server for online feature retrieval. We plan to release a more comprehensive set of benchmarks. For more details, see the .

The Go feature retrieval online feature logging for Data Quality Monitoring is currently in development. More information can be found .

The Feast CLI comes bundled with the Feast Python package. It is immediately available after .

Python feature server
RFC
here
Usage: feast [OPTIONS] COMMAND [ARGS]...

  Feast CLI

  For more information, see our public docs at https://docs.feast.dev/

  For any questions, you can reach us at https://slack.feast.dev/

Options:
  -c, --chdir TEXT  Switch to a different feature repository directory before
                    executing the given subcommand.

  --help            Show this message and exit.

Commands:
  apply                    Create or update a feature store deployment
  entities                 Access entities
  feature-views            Access feature views
  init                     Create a new Feast repository
  materialize              Run a (non-incremental) materialization job to...
  materialize-incremental  Run an incremental materialization job to ingest...
  registry-dump            Print contents of the metadata registry
  teardown                 Tear down deployed feature store infrastructure
  version                  Display Feast SDK version
feast -c path/to/my/feature/repo apply
feast apply
feast entities list
NAME       DESCRIPTION    TYPE
driver_id  driver id      ValueType.INT64
feast feature-views list
NAME                 ENTITIES
driver_hourly_stats  ['driver_id']
feast init my_repo_name
Creating a new Feast repository in /projects/my_repo_name.
.
├── data
│   └── driver_stats.parquet
├── example.py
└── feature_store.yaml
feast init -t gcp my_feature_repo
feast init -t gcp my_feature_repo
feast materialize 2020-01-01T00:00:00 2022-01-01T00:00:00
feast materialize -v driver_hourly_stats 2020-01-01T00:00:00 2022-01-01T00:00:00
Materializing 1 feature views from 2020-01-01 to 2022-01-01

driver_hourly_stats:
100%|██████████████████████████| 5/5 [00:00<00:00, 5949.37it/s]
feast materialize-incremental 2022-01-01T00:00:00
feast teardown
feast version
installing Feast

[Alpha] Web UI

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

Overview

The Feast Web UI allows users to explore their feature repository through a Web UI. It includes functionality such as:

  • Browsing Feast objects (feature views, entities, data sources, feature services, and saved datasets) and their relationships

  • Searching and filtering for Feast objects by tags

Usage

There are several ways to use the Feast Web UI.

Feast CLI

The easiest way to get started is to run the feast ui command within a feature repository:

Output of feast ui --help:

Usage: feast ui [OPTIONS]

Shows the Feast UI over the current directory

Options:
-h, --host TEXT                 Specify a host for the server [default: 0.0.0.0]
-p, --port INTEGER              Specify a port for the server [default: 8888]
-r, --registry_ttl_sec INTEGER  Number of seconds after which the registry is refreshed. Default is 5 seconds.
--help                          Show this message and exit.

This will spin up a Web UI on localhost which automatically refreshes its view of the registry every registry_ttl_sec

Importing as a module to integrate with an existing React App

This is the recommended way to use Feast UI for teams maintaining their own internal UI for their deployment of Feast.

Start with bootstrapping a React app with create-react-app

npx create-react-app your-feast-ui

Then, in your app folder, install Feast UI and its peer dependencies. Assuming you use yarn

yarn add @feast-dev/feast-ui
yarn add @elastic/eui @elastic/datemath @emotion/react moment prop-types inter-ui react-query react-router-dom use-query-params zod typescript query-string d3 @types/d3

Edit index.js in the React app to use Feast UI.

import React from "react";
import ReactDOM from "react-dom";
import "./index.css";

import FeastUI from "@feast-dev/feast-ui";
import "@feast-dev/feast-ui/dist/feast-ui.css";

ReactDOM.render(
  <React.StrictMode>
    <FeastUI />
  </React.StrictMode>,
  document.getElementById("root")
);

When you start the React app, it will look for project-list.json to find a list of your projects. The JSON should looks something like this.

{
  "projects": [
    {
      "name": "Credit Score Project",
      "description": "Project for credit scoring team and associated models.",
      "id": "credit_score_project",
      "registryPath": "/registry.json"
    }
  ]
}

Then start the React App

yarn start

Customization

The advantage of importing Feast UI as a module is in the ease of customization. The <FeastUI> component exposes a feastUIConfigs prop thorough which you can customize the UI. Currently it supports a few parameters.

Fetching the Project List

You can use projectListPromise to provide a promise that overrides where the Feast UI fetches the project list from.

<FeastUI
  feastUIConfigs={{
    projectListPromise: fetch(SOME_PATH, {
      headers: {
        "Content-Type": "application/json",
      },
    }).then((res) => {
      return res.json();
    })
  }}
/>

Custom Tabs

You can add custom tabs for any of the core Feast objects through the tabsRegistry.

const tabsRegistry = {
  RegularFeatureViewCustomTabs: [
    {
      label: "Custom Tab Demo", // Navigation Label for the tab
      path: "demo-tab", // Subpath for the tab
      Component: RFVDemoCustomTab, // a React Component
    },
  ]
}

<FeastUI
  feastUIConfigs={{
    tabsRegistry: tabsRegistry,
  }}
/>

Examples of custom tabs can be found in the ui/custom-tabs folder.

[Alpha] Data quality monitoring

Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules. Validation could be applied during:

  • Historical retrieval (training dataset generation)

  • [planned] Writing features into an online store

  • [planned] Reading features from an online store

Its goal is to address several complex data problems, namely:

  • Data consistency - new training datasets can be significantly different from previous datasets. This might require a change in model architecture.

  • Issues/bugs in the upstream pipeline - bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store.

  • Training/serving skew - distribution shift could significantly decrease the performance of the model.

To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset. How exactly profile equivalency should be measured is up to the user.

Overview

The validation process consists of the following steps:

  1. Validation of tested dataset is performed with reference dataset and profiler provided as parameters.

Preparations

Feast with Great Expectations support can be installed via

pip install 'feast[ge]'

Dataset profile

Great Expectations supports automatic profiling as well as manually specifying expectations:

from great_expectations.dataset import Dataset
from great_expectations.core.expectation_suite import ExpectationSuite

from feast.dqm.profilers.ge_profiler import ge_profiler

@ge_profiler
def automatic_profiler(dataset: Dataset) -> ExpectationSuite:
    from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

    return UserConfigurableProfiler(
        profile_dataset=dataset,
        ignored_columns=['conv_rate'],
        value_set_threshold='few'
    ).build_suite()

However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations:

@ge_profiler
def manual_profiler(dataset: Dataset) -> ExpectationSuite:
    dataset.expect_column_max_to_be_between("column", 1, 2)
    return dataset.get_expectation_suite()

Validating Training Dataset

During retrieval of historical features, validation_reference can be passed as a parameter to methods .to_df(validation_reference=...) or .to_arrow(validation_reference=...) of RetrievalJob. If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned. Otherwise, feast.dqm.errors.ValidationFailed exception would be raised. It will consist of all details for expectations that didn't pass.

from feast import FeatureStore

fs = FeatureStore(".")

job = fs.get_historical_features(...)
job.to_df(
    validation_reference=fs
        .get_saved_dataset("my_reference_dataset")
        .as_reference(profiler=manual_profiler)
)

[Alpha] AWS Lambda feature server

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

To enable this feature, run feast alpha enable aws_lambda_feature_server

Overview

Deployment

The AWS Lambda feature server is only available to projects using the AwsProvider with registries on S3. It is disabled by default. To enable it, feature_store.yaml must be modified; specifically, the enable flag must be on and an execution_role_name must be specified. For example, after running feast init -t aws, changing the registry to be on S3, and enabling the feature server, the contents of feature_store.yaml should look similar to the following:

If enabled, the feature server will be deployed during feast apply. After it is deployed, the feast endpoint CLI command will indicate the server's endpoint.

Permissions

Feast requires the following permissions in order to deploy and teardown AWS Lambda feature server:

The following inline policy can be used to grant Feast the necessary permissions:

Example

After feature_store.yaml has been modified as described in the previous section, it can be deployed as follows:

After the feature server starts, we can execute cURL commands against it:

Contribution process

Usage

How Feast SDK usage is measured

The Feast project logs anonymous usage statistics and errors in order to inform our planning. Several client methods are tracked, beginning in Feast 0.9. Users are assigned a UUID which is sent along with the name of the method, the Feast version, the OS (using sys.platform), and the current time.

How to disable usage logging

Set the environment variable FEAST_USAGE to False.

[Alpha] On demand feature view

Warning: This is an experimental feature. It's intended for early testing and feedback, and could change without warnings in future releases.

To enable this feature, run feast alpha enable on_demand_transforms

Overview

On demand feature views allows users to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths.

Currently, these transformations are executed locally. Future milestones include building a Feature Transformation Server for executing transformations at higher scale.

CLI

There are new CLI commands:

  • feast on-demand-feature-views list lists all registered on demand feature view after feast apply is run

  • feast on-demand-feature-views describe [NAME] describes the definition of an on demand feature view

Example

Registering transformations

We register RequestSource inputs and the transform in on_demand_feature_view:

Feature retrieval

The on demand feature view's name is the function name (i.e. transformed_conv_rate).

And then to retrieve historical or online features, we can call this in a feature service or reference individual features:

Versioning policy

Versioning policies and status of Feast components

Versioning policy and branch workflow

Contributors are encouraged to understand our branch workflow described below, for choosing where to branch when making a change (and thus the merge base for a pull request).

  • Major and minor releases are cut from the master branch.

  • Each major and minor release has a long-lived maintenance branch, e.g., v0.3-branch. This is called a "release branch".

  • From the release branch the pre-release release candidates are tagged, e.g., v0.3.0-rc.1

  • From the release candidates the stable patch version releases are tagged, e.g.,v0.3.0.

A release branch should be substantially feature complete with respect to the intended release. Code that is committed to master may be merged or cherry-picked on to a release branch, but code that is directly committed to a release branch should be solely applicable to that release (and should not be committed back to master).

In general, unless you're committing code that only applies to a particular release stream (for example, temporary hot-fixes, back-ported security fixes, or image hashes), you should base changes from master and then merge or cherry-pick to the release branch.

Feast Component Matrix

The following table shows the status (stable, beta, or alpha) of Feast components.

Application status indicators for Feast:

  • Stable means that the component has reached a sufficient level of stability and adoption that the Feast community has deemed the component stable. Please see the stability criteria below.

  • Beta means that the component is working towards a version 1.0 release. Beta does not mean a component is unstable, it simply means the component has not met the full criteria of stability.

  • Alpha means that the component is in the early phases of development and/or integration into Feast.

Criteria for reaching stable status:

  • Contributors from at least two organizations

  • Complete end-to-end test suite

  • Scalability and load testing if applicable

  • Automated release process (docker images, PyPI packages, etc)

  • API reference documentation

  • No deprecative changes

  • Must include logging and monitoring

Criteria for reaching beta status

  • Contributors from at least two organizations

  • End-to-end test suite

  • API reference documentation

  • Deprecative changes must span multiple minor versions and allow for an upgrade path.

Levels of support

Feast components have various levels of support based on the component status.

Support from the Feast community

Feast has an active and helpful community of users and contributors.

The Feast community offers support on a best-effort basis for stable and beta applications. Best-effort support means that there’s no formal agreement or commitment to solve a problem but the community appreciates the importance of addressing the problem as soon as possible. The community commits to helping you diagnose and address the problem if all the following are true:

  • The cause falls within the technical framework that Feast controls. For example, the Feast community may not be able to help if the problem is caused by a specific network configuration within your organization.

  • Community members can reproduce the problem.

  • The reporter of the problem can help with further diagnosis and troubleshooting.

User prepares reference dataset (currently only from historical retrieval are supported).

User defines profiler function, which should produce profile by given dataset (currently only profilers based on are allowed).

Currently, Feast supports only as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an .

The AWS Lambda feature server is an HTTP endpoint that serves features with JSON I/O, deployed as a Docker image through AWS Lambda and AWS API Gateway. This enables users to get features from Feast using any programming language that can make HTTP requests. A is also available. A remote feature server on GCP Cloud Run is currently being developed.

Permissions
Resources

We use and to communicate development ideas. The simplest way to contribute to Feast is to leave comments in our in the or our GitHub issues. You will need to join our in order to get access.

We follow a process of . If you believe you know what the project needs then just start development. If you are unsure about which direction to take with development then please communicate your ideas through a GitHub issue or through our before starting development.

Please to the master branch of the Feast repository once you are ready to submit your contribution. Code submission to Feast (including submission from project maintainers) require review and approval from maintainers or code owners.

PRs that are submitted by the general public need to be identified as ok-to-test. Once enabled, will run a range of tests to verify the submission, after which community members will help to review the pull request.

See also for other ways to get involved with the community (e.g. joining community calls)

The is available here.

See for an example on how to use on demand feature views.

Feast uses .

Application
Status
Notes
Application status
Level of support

Please see the page for channels through which support can be requested.

saved datasets
Great Expectations
Great Expectation's
ExpectationSuite
ExpectationSuite
project: dev
registry: s3://feast/registries/dev
provider: aws
online_store:
  region: us-west-2
offline_store:
  cluster_id: feast
  region: us-west-2
  user: admin
  database: feast
  s3_staging_location: s3://feast/redshift/tests/staging_location
  iam_role: arn:aws:iam::{aws_account}:role/redshift_s3_access_role
flags:
  alpha_features: true
  aws_lambda_feature_server: true
feature_server:
  enabled: True
  execution_role_name: arn:aws:iam::{aws_account}:role/lambda_execution_role

lambda:CreateFunction

lambda:GetFunction

lambda:DeleteFunction

lambda:AddPermission

lambda:UpdateFunctionConfiguration

arn:aws:lambda:<region>:<account_id>:function:feast-*

ecr:CreateRepository

ecr:DescribeRepositories

ecr:DeleteRepository

ecr:PutImage

ecr:DescribeImages

ecr:BatchDeleteImage

ecr:CompleteLayerUpload

ecr:UploadLayerPart

ecr:InitiateLayerUpload

ecr:BatchCheckLayerAvailability

ecr:GetDownloadUrlForLayer

ecr:GetRepositoryPolicy

ecr:SetRepositoryPolicy

ecr:GetAuthorizationToken

*

iam:PassRole

arn:aws:iam::<account_id>:role/

apigateway:*

arn:aws:apigateway:::/apis//routes//routeresponses

arn:aws:apigateway:::/apis//routes//routeresponses/

arn:aws:apigateway:::/apis//routes/

arn:aws:apigateway:::/apis//routes

arn:aws:apigateway:::/apis//integrations

arn:aws:apigateway:::/apis//stages//routesettings/

arn:aws:apigateway:::/apis/

arn:aws:apigateway:*::/apis

{
    "Statement": [
        {
        Action = [
          "lambda:CreateFunction",
          "lambda:GetFunction",
          "lambda:DeleteFunction",
          "lambda:AddPermission",
          "lambda:UpdateFunctionConfiguration",
        ]
        Effect = "Allow"
        Resource = "arn:aws:lambda:<region>:<account_id>:function:feast-*"
      },
      {
        Action = [
            "ecr:CreateRepository",
            "ecr:DescribeRepositories",
            "ecr:DeleteRepository",
            "ecr:PutImage",
            "ecr:DescribeImages",
            "ecr:BatchDeleteImage",
            "ecr:CompleteLayerUpload",
            "ecr:UploadLayerPart",
            "ecr:InitiateLayerUpload",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:GetRepositoryPolicy",
            "ecr:SetRepositoryPolicy",
            "ecr:GetAuthorizationToken"
        ]
        Effect = "Allow"
        Resource = "*"
      },
      {
        Action = "iam:PassRole"
        Effect = "Allow"
        Resource = "arn:aws:iam::<account_id>:role/<lambda-execution-role-name>"
      },
      {
        Effect = "Allow"
        Action = "apigateway:*"
        Resource = [
            "arn:aws:apigateway:*::/apis/*/routes/*/routeresponses",
            "arn:aws:apigateway:*::/apis/*/routes/*/routeresponses/*",
            "arn:aws:apigateway:*::/apis/*/routes/*",
            "arn:aws:apigateway:*::/apis/*/routes",
            "arn:aws:apigateway:*::/apis/*/integrations",
            "arn:aws:apigateway:*::/apis/*/stages/*/routesettings/*",
            "arn:aws:apigateway:*::/apis/*",
            "arn:aws:apigateway:*::/apis",
        ]
      },
    ],
    "Version": "2012-10-17"
}
$ feast apply
10/07/2021 03:57:26 PM INFO:Pulling remote image feastdev/feature-server-python-aws:aws:
10/07/2021 03:57:28 PM INFO:Creating remote ECR repository feast-python-server-key_shark-0_13_1_dev23_gb3c08320:
10/07/2021 03:57:29 PM INFO:Pushing local image to remote 402087665549.dkr.ecr.us-west-2.amazonaws.com/feast-python-server-key_shark-0_13_1_dev23_gb3c08320:0_13_1_dev23_gb3c08320:
10/07/2021 03:58:44 PM INFO:Deploying feature server...
10/07/2021 03:58:45 PM INFO:  Creating AWS Lambda...
10/07/2021 03:58:46 PM INFO:  Creating AWS API Gateway...
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats

$ feast endpoint
10/07/2021 03:59:01 PM INFO:Feature server endpoint: https://hkosgmz4m2.execute-api.us-west-2.amazonaws.com

$ feast materialize-incremental $(date +%Y-%m-%d)
Materializing 1 feature views to 2021-10-06 17:00:00-07:00 into the dynamodb online store.

driver_hourly_stats from 2020-10-08 23:01:34-07:00 to 2021-10-06 17:00:00-07:00:
100%|█████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 16.89it/s]
$ curl -X POST \                                 
    "https://hkosgmz4m2.execute-api.us-west-2.amazonaws.com/get-online-features" \
    -H "Content-type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "features": [
            "driver_hourly_stats:conv_rate",
            "driver_hourly_stats:acc_rate",
            "driver_hourly_stats:avg_daily_trips"
        ],
        "entities": {
            "driver_id": [1001, 1002, 1003]
        },
        "full_feature_names": true
    }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1346  100  1055  100   291   3436    947 --:--:-- --:--:-- --:--:--  4370
{
  "field_values": [
    {
      "fields": {
        "driver_id": 1001,
        "driver_hourly_stats__conv_rate": 0.025330161675810814,
        "driver_hourly_stats__avg_daily_trips": 785,
        "driver_hourly_stats__acc_rate": 0.835975170135498
      },
      "statuses": {
        "driver_hourly_stats__avg_daily_trips": "PRESENT",
        "driver_id": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT",
        "driver_hourly_stats__acc_rate": "PRESENT"
      }
    },
    {
      "fields": {
        "driver_hourly_stats__conv_rate": 0.7595187425613403,
        "driver_hourly_stats__acc_rate": 0.1740121990442276,
        "driver_id": 1002,
        "driver_hourly_stats__avg_daily_trips": 875
      },
      "statuses": {
        "driver_hourly_stats__acc_rate": "PRESENT",
        "driver_id": "PRESENT",
        "driver_hourly_stats__avg_daily_trips": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT"
      }
    },
    {
      "fields": {
        "driver_hourly_stats__acc_rate": 0.7785481214523315,
        "driver_hourly_stats__conv_rate": 0.33832859992980957,
        "driver_hourly_stats__avg_daily_trips": 846,
        "driver_id": 1003
      },
      "statuses": {
        "driver_id": "PRESENT",
        "driver_hourly_stats__conv_rate": "PRESENT",
        "driver_hourly_stats__acc_rate": "PRESENT",
        "driver_hourly_stats__avg_daily_trips": "PRESENT"
      }
    }
  ]
}
from feast import Field, RequestSource
from feast.types import Float64, Int64
import pandas as pd

# Define a request data source which encodes features / information only
# available at request time (e.g. part of the user initiated HTTP request)
input_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name='val_to_add', dtype=Int64),
        Field(name='val_to_add_2', dtype=Int64)
    ]
)

# Use the input data and feature view features to create new features
@on_demand_feature_view(
   sources=[
       driver_hourly_stats_view,
       input_request
   ],
   schema=[
     Field(name='conv_rate_plus_val1', dtype=Float64),
     Field(name='conv_rate_plus_val2', dtype=Float64)
   ]
)
def transformed_conv_rate(features_df: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df['conv_rate_plus_val1'] = (features_df['conv_rate'] + features_df['val_to_add'])
    df['conv_rate_plus_val2'] = (features_df['conv_rate'] + features_df['val_to_add_2'])
    return df
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

Stable

The Feast community offers best-effort support for stable applications. Stable components will be offered long term support

Beta

The Feast community offers best-effort support for beta applications. Beta applications will be supported for at least 2 more minor releases.

Alpha

The response differs per application in alpha status, depending on the size of the community for that application and the current level of active development of the application.

local feature server
RFCs
GitHub issues
RFCs
Feast Google Drive
Google Group
lazy consensus
Slack Channel
submit a PR
Prow
Community
source code
https://github.com/feast-dev/on-demand-feature-views-demo
semantic versioning
Community

Beta

APIs are considered stable and will not have breaking changes within 3 minor versions.

Beta

At risk of deprecation

Beta

Beta

Beta

Alpha

Alpha

Alpha

Scheduled for deprecation

Beta

Release process

Release process

For Feast maintainers, these are the concrete steps for making a new release.

  1. For new major or minor release, create and check out the release branch for the new stream, e.g. v0.6-branch. For a patch version, check out the stream's release branch.

  2. Update versions for the release/release candidate with a commit:

    1. In the root pom.xml, remove -SNAPSHOT from the <revision> property, update versions, and commit.

    2. Tag the commit with the release version, using a v and sdk/go/v prefixes

      • for a release candidate, create tags vX.Y.Z-rc.Nand sdk/go/vX.Y.Z-rc.N

      • for a stable release X.Y.Z create tags vX.Y.Z and sdk/go/vX.Y.Z

    3. Check that versions are updated with make lint-versions.

    4. If changes required are flagged by the version lint, make the changes, amend the commit and move the tag to the new commit.

  3. Push the commits and tags. Make sure the CI passes.

    • If the CI does not pass, or if there are new patches for the release fix, repeat step 2 & 3 with release candidates until stable release is achieved.

  4. Bump to the next patch version in the release branch, append -SNAPSHOT in pom.xml and push.

  5. Create a PR against master to:

    1. Bump to the next major/minor version and append -SNAPSHOT .

    2. Add the change log by applying the change log commit created in step 2.

    3. Check that versions are updated with env TARGET_MERGE_BRANCH=master make lint-versions

When a tag that matches a Semantic Version string is pushed, CI will automatically build and push the relevant artifacts to their repositories or package managers (docker images, Python wheels, etc). JVM artifacts are promoted from Sonatype OSSRH to Maven Central, but it sometimes takes some time for them to be available. The sdk/go/v tag is required to version the Go SDK go module so that users can go get a specific tagged release of the Go SDK.

Creating a change log

  1. The change log generator configuration below will look for unreleased changes on a specific branch. The branch will be master for a major/minor release, or a release branch (v0.4-branch) for a patch release. You will need to set the branch using the --release-branch argument.

  2. You should also set the --future-release argument. This is the version you are releasing. The version can still be changed at a later date.

  3. Update the arguments below and run the command to generate the change log to the console.

docker run -it --rm ferrarimarco/github-changelog-generator \
--user feast-dev \
--project feast  \
--release-branch <release-branch-to-find-changes>  \
--future-release <proposed-release-version>  \
--unreleased-only  \
--no-issues  \
--bug-labels kind/bug  \
--enhancement-labels kind/feature  \
--breaking-labels compat/breaking  \
-t <your-github-token>  \
--max-issues 1 \
-o
  1. Review each change log item.

    • Make sure that sentences are grammatically correct and well formatted (although we will try to enforce this at the PR review stage).

    • Make sure that each item is categorised correctly. You will see the following categories: Breaking changes, Implemented enhancements, Fixed bugs, and Merged pull requests. Any unlabelled PRs will be found in Merged pull requests. It's important to make sure that any breaking changes, enhancements, or bug fixes are pulled up out of merged pull requests into the correct category. Housekeeping, tech debt clearing, infra changes, or refactoring do not count as enhancements. Only enhancements a user benefits from should be listed in that category.

    • Make sure that the "Full Change log" link is actually comparing the correct tags (normally your released version against the previously version).

    • Make sure that release notes and breaking changes are present.

Flag Breaking Changes & Deprecations

It's important to flag breaking changes and deprecation to the API for each release so that we can maintain API compatibility.

Developers should have flagged PRs with breaking changes with the compat/breaking label. However, it's important to double check each PR's release notes and contents for changes that will break API compatibility and manually label compat/breaking to PRs with undeclared breaking changes. The change log will have to be regenerated if any new labels have to be added.

Development guide

Overview

This guide is targeted at developers looking to contribute to Feast:

Project Structure

Repository
Description
Component(s)

Hosts all required code to run Feast. This includes the Feast Python SDK and Protobuf definitions. For legacy reasons this repository still contains Terraform config and a Go Client for Feast.

  • Python SDK / CLI

  • Protobuf APIs

  • Documentation

  • Go Client

  • Terraform

Java-specific Feast components. Includes the Feast Core Registry, Feast Serving for serving online feature values, and the Feast Java Client for retrieving feature values.

  • Core

  • Serving

  • Java Client

Feast Spark SDK & Feast Job Service for launching ingestion jobs and for building training datasets with Spark

  • Spark SDK

  • Job Service

Helm Chart for deploying Feast on Kubernetes & Spark.

  • Helm Chart

Making a Pull Request

Incorporating upstream changes from master

Our preference is the use of git rebase instead of git merge : git pull -r

Signing commits

Commits have to be signed before they are allowed to be merged into the Feast codebase:

# Include -s flag to signoff
git commit -s -m "My first commit"

Good practices to keep in mind

  • Fill in the description based on the default template configured when you first open the PR

    • What this PR does/why we need it

    • Which issue(s) this PR fixes

    • Does this PR introduce a user-facing change

  • Include kind label when opening the PR

  • Add WIP: to PR name if more work needs to be done prior to review

  • Avoid force-pushing as it makes reviewing difficult

Managing CI-test failures

  • GitHub runner tests

    • Click checks tab to analyse failed tests

  • Prow tests

Feast Data Storage Format

Feast data storage contracts are documented in the following locations:

Feast Protobuf API

Feast Protobuf API defines the common API used by Feast's Components:

Generating Language Bindings

The language specific bindings have to be regenerated when changes are made to the Feast Protobuf API:

Repository
Language
Regenerating Language Bindings

Python

Run make compile-protos-python to generate bindings

Golang

Run make compile-protos-go to generate bindings

Java

No action required: bindings are generated automatically during compilation.

Update the . See the guide and commit

Make to review each PR in the changelog to

Create a which includes a summary of important changes as well as any artifacts associated with the release. Make sure to include the same change log as added in . Use Feast vX.Y.Z as the title.

We use an to generate change logs. The process still requires a little bit of manual effort.

Create a GitHub token as . The token is used as an input argument (-t) to the change log generator.

Learn How the Feast works.

Feast is composed of distributed into multiple repositories:

See also the CONTRIBUTING.md in the corresponding GitHub repository (e.g. )

Visit to analyse failed tests

: Used by BigQuery, Snowflake (Future), Redshift (Future).

: Used by Redis, Google Datastore.

Feast Protobuf API specifications are written in in the Main Feast Repository.

Changes to the API should be proposed via a for discussion first.

Feast Serving
Feast Core
Feast Java Client
Feast Python SDK
Feast Go Client
Feast Spark Python SDK
Feast Spark Launchers
Feast Job Service
Feast Helm Chart
GitHub release
CHANGELOG.md
open source change log generator
per these instructions
Contributing Process
multiple components
main repo doc
Prow status page
Feast Offline Storage Format
Feast Online Storage Format
proto3
GitHub Issue
CHANGELOG.md
Creating a change log
flag any breaking changes and deprecation.
Project Structure
Making a Pull Request
Feast Data Storage Format
Feast Protobuf API
Main Feast Repository
Feast Java
Feast Spark
Feast Helm Chart
Main Feast Repository
Main Feast Repository
Feast Java

Feast 0.9 vs Feast 0.10+

Feast 0.10 brought about major changes to the way Feast is architected and how the software is intended to be deployed, extended, and operated.

Changes introduced in Feast 0.10

Challenges in Feast 0.9** (Before)**
Changed in Feast 0.10+ (After)

Hard to install because it was a heavy-weight system with many components requiring a lot of configuration

  • Easy to install via pip install

  • Opinionated default configurations

  • No Helm charts necessary

Engineering support needed to deploy/operate reliably

  • Feast moves from a stack of services to a CLI/SDK

  • No need for Kubernetes or Spark

  • No long running processes or orchestrators

  • Leverages globally available managed services where possible

Hard to develop/debug with tightly coupled components, async operations, and hard to debug components like Spark

  • Easy to develop and debug

  • Modular components

  • Clear extension points

  • Fewer background operations

  • Faster feedback

  • Local mode

Inability to benefit from cloud-native technologies because of focus on reusable technologies like Kubernetes and Spark

  • Leverages best-in-class cloud technologies so users can enjoy scalable + powerful tech stacks without managing open source stacks themselves

Changes in more detail

Where Feast 0.9 was a large stack of components that needed to be deployed to Kubernetes, Feast 0.10 is simply a lightweight SDK and CLI. It doesn’t need any long-running processes to operate. This SDK/CLI can deploy and configure your feature store to your infrastructure, and execute workflows like building training datasets or reading features from an online feature store.

  • Feast 0.10 introduces local mode: Local mode allows users to try out Feast in a completely local environment (without using any cloud technologies). This provides users with a responsive means of trying out the software before deploying it into a production environment.

  • Feast comes with opinionated defaults: As much as possible we are attempting to make Feast a batteries-included feature store that removes the need for users to configure infinite configuration options (as with Feast 0.9). Feast 0.10 comes with sane default configuration options to deploy Feast on your infrastructure.

  • Feast Core was replaced by a file-based (S3, GCS) registry: Feast Core is a metadata server that maintains and exposes an API of feature definitions. With Feast 0.10, we’ve moved this entire service into a single flat file that can be stored on either the local disk or in a central object store like S3 or GCS. The benefit of this change is that users don’t need to maintain a database and a registry service, yet they can still access all the metadata they had before.

  • Materialization is a CLI operation: Instead of having ingestion jobs be managed by a job service, users can now schedule a batch ingestion job themselves by calling “materialize”. This change was introduced because most teams already have schedulers like Airflow in their organization. By starting ingestion jobs from Airflow, teams are now able to easily track state outside of Feast and to debug failures synchronously. Similarly, streaming ingestion jobs can be launched through the “apply” command

  • Doubling down on data warehouses: Most modern data teams are doubling down on data warehouses like BigQuery, Snowflake, and Redshift. Feast doubles down on these big data technologies as the primary interfaces through which it launches batch operations (like training dataset generation). This reduces the development burden on Feast contributors (since they only need to reason about SQL), provides users with a more responsive experience, avoids moving data from the warehouse (to compute joins using Spark), and provides a more serverless and scalable experience to users.

  • Temporary loss of streaming support: Unfortunately, Feast 0.10, 0.11, and 0.12 do not support streaming feature ingestion out of the box. It is entirely possible to launch streaming ingestion jobs using these Feast versions, but it requires the use of a Feast extension point to launch these ingestion jobs. It is still a core design goal for Feast to support streaming ingestion, so this change is in the development backlog for the Feast project.

  • **Addition of extension points: **Feast 0.10+ introduces various extension points. Teams can override all feature store behavior by writing (or extending) a provider. It is also possible for teams to add their own data storage connectors for both an offline and online store using a plugin interface that Feast provides.

Comparison of architectures

Feast 0.9

Feast 0.10, 0.11, and 0.12 architecture

Feast 1.0 architecture (eventual goal)

Comparison of components

Component
Feast 0.9
Feast 0.10, 011, 0.12+

Architecture

  • Service-oriented architecture

  • Containers and services deployed to Kubernetes

  • SDK/CLI centric software

  • Feast is able to deploy or configure infrastructure for use as a feature store

Installation

Terraform and Helm

  • Pip to install SDK/CLI

  • Provider used to deploy Feast components to GCP, AWS, or other environments during apply

Required infrastructure

Kubernetes, Postgres, Spark, Docker, Object Store

None

Batch compute

Yes (Spark based)

  • Python native (client-side) for batch data loading

  • Data warehouse for batch compute

Streaming support

Yes (Spark based)

Planned. Streaming jobs will be launched using apply

Offline store

None (can source data from any source Spark supports)

BigQuery, Snowflake (planned), Redshift, or custom implementations

Online store

Redis

DynamoDB, Firestore, Redis, and more planned.

Job Manager

Yes

No

Registry

gRPC service with Postgres backend

File-based registry with accompanying SDK for exploration

Local Mode

No

Yes

Upgrading from Feast 0.9 to the latest Feast

Please see for a guide on how to upgrade to the latest Feast version.

Feast contributors identified various in Feast 0.9 that made deploying, operating, extending, and maintaining it challenging. These challenges applied both to users and contributors. Our goal is to make ML practitioners immediately productive in operationalizing data for machine learning. To that end, Feast 0.10+ made the following improvements on Feast 0.9:

Please see the .

Upgrading from Feast 0.9
design challenges
Feast 0.9 Upgrade Guide
Overview
From Repository to Production: Feast Production Architecture
With Push Service as Lambda
With Push Service in Kubernetes
Sample UI
Demo parquet data: data/driver_stats.parquet
Entity dataframe containing timestamps, driver ids, and the target variable
Ride-hailing data source
Feast Architecture Diagram