Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.
Below is an example data source with a single entity column (driver
) and two feature columns (trips_today
, and rating
).
Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:
Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
[Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
(Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into on-demand feature views, which allow light-weight feature engineering and combining features across sources.
Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).
A key command to use in Feast is the materialize_incremental
command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
Materialization can be called programmatically or through the CLI:
If the schema
parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply
. The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, or if a query is provided to the source, by running the query with a LIMIT
clause and inspecting the result.
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
To push data into the offline or online stores: see push sources for details.
(experimental) To use a contrib Spark processor to ingest from a topic, see Tutorial: Building streaming features
The top-level namespace within Feast is a project. Users define one or more feature views within a project. Each feature view contains one or more features. These features typically relate to one or more entities. A feature view must always have a data source, which in turn is used during the generation of training datasets and when materializing feature values into the online store.
Projects provide complete isolation of feature stores at the infrastructure level. This is accomplished through resource namespacing, e.g., prefixing table names with the associated project. Each project should be considered a completely separate universe of entities and features. It is not possible to retrieve features from multiple projects in a single request. We recommend having a single feature store and a single project per environment (dev
, staging
, prod
).
For offline use cases that only rely on batch data, Feast does not need to ingest data and can query your existing data (leveraging a compute engine, whether it be a data warehouse or (experimental) Spark / Trino). Feast can help manage pushing streaming features to a batch source to make features available for training.
For online use cases, Feast supports ingesting features from batch sources to make them available online (through a process called materialization), and pushing streaming features to make them available both offline / online. We explore this more in the next concept page (Data ingestion)
Features are registered as code in a version controlled repository, and tie to data sources + model versions via the concepts of entities, feature views, and feature services. We explore these concepts more in the upcoming concept pages. These features are then stored in a registry, which can be accessed across users and services. The features can then be retrieved via SDK API methods or via a deployed feature server which exposes endpoints to query for online features (to power real time models).
Feast supports several patterns of feature retrieval.
Training data generation
Fetching user and item features for (user, item) pairs when training a production recommendation model
get_historical_features
Offline feature retrieval for batch predictions
Predicting user churn for all users on a daily basis
get_historical_features
Online feature retrieval for real-time model predictions
Fetching pre-computed features to predict whether a real-time credit card transaction is fraudulent
get_online_features
Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. Data Quality Monitoring was the primary motivation for creating dataset concept.
Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the offline store.
Dataset can be created from:
Results of historical retrieval
[planned] Logging request (including input for on demand transformation) and response during feature serving
[planned] Logging features during writing to online store (from batch source or stream)
To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features
method first and then pass the returned retrieval job to create_saved_dataset
method. create_saved_dataset
will trigger the provided retrieval job (by calling .persist()
on it) to store the data using the specified storage
behind the scenes. Storage type must be the same as the globally configured offline store (e.g it's impossible to persist data to a different offline source). create_saved_dataset
will also create a SavedDataset
object with all of the related metadata and will write this object to the registry.
Saved dataset can be retrieved later using the get_saved_dataset
method in the feature store:
Check out our tutorial on validating historical features to see how this concept can be applied in a real-world use case.
An entity is a collection of semantically related features. Users define entities to map to the domain of their use case. For example, a ride-hailing service could have customers and drivers as their entities, which group related features that correspond to these customers and drivers.
The entity name is used to uniquely identify the entity (for example to show in the experimental Web UI). The join key is used to identify the physical primary key on which feature values should be joined together to be retrieved during feature retrieval.
Entities are used by Feast in many contexts, as we explore below:
Feast's primary object for defining features is a feature view, which is a collection of features. Feature views map to 0 or more entities, since a feature can be associated with:
zero entities (e.g. a global feature like num_daily_global_transactions)
one entity (e.g. a user feature like user_age or last_5_bought_items)
multiple entities, aka a composite key (e.g. a user + merchant category feature like num_user_purchases_in_merchant_category)
Feast refers to this collection of entities for a feature view as an entity key.
Entities should be reused across feature views. This helps with discovery of features, since it enables data scientists understand how other teams build features for the entity they are most interested in.
Feast will use the feature view concept to then define the schema of groups of features in a low-latency online store.
At serving time, users specify entity key(s) to fetch the latest feature values which can power real-time model prediction (e.g. a fraud detection model that needs to fetch the latest transaction user's features to make a prediction).
Q: Can I retrieve features for all entities?
Kind of.
For real-time feature retrieval, there is no out of the box support for this because it would promote expensive and slow scan operations which can affect the performance of other operations on your data sources. Users can still pass in a large list of entities for retrieval, but this does not scale well.
At training time, users control what entities they want to look up, for example corresponding to train / test / validation splits. A user specifies a list of entity keys + timestamps they want to fetch correct features for to generate a training dataset.
In practice, this is most relevant for batch scoring models (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an that welcomes contribution to make this a more intuitive API.