Feature retrieval (or serving) is the process of retrieving either historical features or online features from Feast, for the purposes of training or serving a model.
Feast attempts to unify the process of retrieving features in both the historical and online case. It does this through the creation of feature references. One of the major advantages of using Feast is that you have a single semantic reference to a feature. These feature references can then be stored alongside your model and loaded into a serving layer where it can be used for online feature retrieval.
In Feast, each feature can be uniquely addressed through a feature reference. A feature reference is composed of the following components
These components can be used to create a string based feature reference as follows
Feast will attempt to infer both the
feature-set name if it is not provided, but a feature reference must provide a
# Feature referencesfeatures = ['partner','daily_transactions','customer_feature_set:dependents','customer_feature_set:has_phone_service',]target = 'churn'
Feature references only apply to a single
project. Features cannot be retrieved across projects in a single request.
Historical feature retrieval can be done through either the Feast SDK or directly through the Feast Serving gRPC API. Below is an example of historical retrieval from the Churn Prediction Notebook.
# Add the target variable to our feature listfeatures = self._features + [self._target]# Retrieve training dataset from Feast. The "entity_df" is a dataframe that contains# timestamps and entity keys. In this case, it is a dataframe with two columns.# One timestamp column, and one customer id columndataset = client.get_batch_features(feature_refs=features,entity_rows=entity_df)# Materialize the dataset object to a Pandas DataFrame.# Alternatively it is possible to use a file reference if the data is too largedf = dataset.to_dataframe()
In the above example, Feast does a point in time correct query from a single feature set. For each timestamp and entity key combination that is provided by
entity_df, Feast determines the values of all the features in the
features list at that respective point in time and then joins features values to that specific entity value and timestamp, and repeats this process for all timestamps.
This is called a point in time correct join.
Feast allows users to retrieve features from any feature sets and join them together in a single response dataset. The only requirement is that the user provides the correct entities in order to look up the features.
Below is another example of how a point-in-time-correct join works. We have two dataframes. The first is the
entity dataframe that contains timestamps, entities, and labels. The user would like to have driver features joined onto this
entity dataframe from the
driver dataframe to produce an
output dataframe that contains both labels and features. They would then like to train their model on this output
input 1 DataFrame would be provided by the user, and the
input 2 DataFrame would already be ingested into Feast. To join these two, the user would call Feast as follows:
# Feature referencesfeatures = ['conv_rate','acc_rate','avg_daily_trips','trip_completed']dataset = client.get_batch_features(feature_refs=features, # this is a list of feature referencesentity_rows=entity_df # This is the entity dataframe above)# This prints out the dataframe belowprint(dataset.to_dataframe())
Feast is able to intelligently join feature data with different timestamps to a single basis table in a point-in-time-correct way. This allows users to join daily batch data with high-frequency event data transparently. They simply need to know the feature names.
Point-in-time-correct joins also prevents the occurrence of feature leakage by trying to accurate the state of the world at a single point in time, instead of just joining features based on the nearest timestamps.
Online feature retrieval works in much the same way as batch retrieval, with one important distinction: Online stores only maintain the current state of features. No historical data is served.
features = ['conv_rate','acc_rate','avg_daily_trips',]data = client.get_online_features(feature_refs=features, # Contains only feature referencesentity_rows=entity_rows, # Contains only entities (driver ids))