All pages
Powered by GitBook
1 of 9

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Running Feast with Snowflake/GCP/AWS

Install FeastCreate a feature repositoryDeploy a feature storeBuild a training datasetLoad data into the online storeRead features from the online storeScaling FeastStructuring Feature Repos

Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See Feature Repository for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use feast init command:

feast init -t snowflake
Snowflake Deployment URL: ...
Snowflake User Name: ...
Snowflake Password: ...
Snowflake Role Name: ...
feast init -t gcp

Creating a new Feast repository in /<...>/tiny_pika.
feast init -t aws
AWS Region (e.g. us-west-2): ...
Redshift Cluster ID: ...
Redshift Database Name: ...
Redshift User Name: ...
Redshift S3 Staging Location (s3://*): ...
Redshift IAM Role for S3 (arn:aws:iam::*:role/*): ...
Should I upload example data to Redshift (overwriting 'feast_driver_hourly_stats' table)? (Y/n):

Creating a new Feast repository in /<...>/tiny_pika.

The init command creates a Python file with feature definitions, sample data, and a Feast configuration file for local development:

Enter the directory:

You can now use this feature repository for development. You can try the following:

  • Run feast apply to apply these definitions to Feast.

  • Edit the example feature definitions in example.py and run feast apply again to change feature definitions.

  • Initialize a git repository in the same directory and checking the feature repository into version control.

feast init

Creating a new Feast repository in /<...>/tiny_pika.
$ tree
.
└── tiny_pika
    β”œβ”€β”€ data
    β”‚   └── driver_stats.parquet
    β”œβ”€β”€ example.py
    └── feature_store.yaml

1 directory, 3 files
Snowflake Warehouse Name: ...
Snowflake Database Name: ...
Creating a new Feast repository in /<...>/tiny_pika.
# Replace "tiny_pika" with your auto-generated dir name
cd tiny_pika

Build a training dataset

Feast allows users to build a training dataset from time-series feature data that already exists in an offline store. Users are expected to provide a list of features to retrieve (which may span multiple feature views), and a dataframe to join the resulting features onto. Feast will then execute a point-in-time join of multiple feature views onto the provided dataframe, and return the full resulting dataframe.

Retrieving historical features

1. Register your feature views

Please ensure that you have created a feature repository and that you have registered (applied) your feature views with Feast.

2. Define feature references

Start by defining the feature references (e.g., driver_trips:average_daily_rides) for the features that you would like to retrieve from the offline store. These features can come from multiple feature tables. The only requirement is that the feature tables that make up the feature references have the same entity (or composite entity), and that they aren't located in the same offline store.

3. Create an entity dataframe

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.

It is possible to provide entity dataframes as either a Pandas dataframe or a SQL query.

Pandas:

In the example below we create a Pandas based entity dataframe that has a single row with an event_timestamp column and a driver_id entity column. Pandas based entity dataframes may need to be uploaded into an offline store, which may result in longer wait times compared to a SQL based entity dataframe.

SQL (Alternative):

Below is an example of an entity dataframe built from a BigQuery SQL query. It is only possible to use this query when all feature views being queried are available in the same offline store (BigQuery).

4. Launch historical retrieval

Once the feature references and an entity dataframe are defined, it is possible to call get_historical_features(). This method launches a job that executes a point-in-time join of features from the offline store onto the entity dataframe. Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

Deploy a feature store
feature_refs = [
    "driver_trips:average_daily_rides",
    "driver_trips:maximum_daily_rides",
    "driver_trips:rating",
    "driver_trips:rating:trip_completed",
]
import pandas as pd
from datetime import datetime

entity_df = pd.DataFrame(
    {
        "event_timestamp": [pd.Timestamp(datetime.now(), tz="UTC")],
        "driver_id": [1001]
    }
)
entity_df = "SELECT event_timestamp, driver_id FROM my_gcp_project.table"
from feast import FeatureStore

fs = FeatureStore(repo_path="path/to/your/feature/repo")

training_df = fs.get_historical_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate"
    ],
    entity_df=entity_df
).to_df()

Install Feast

Install Feast using pip:

pip install feast

Install Feast with Snowflake dependencies (required when using Snowflake):

pip install 'feast[snowflake]'

Install Feast with GCP dependencies (required when using BigQuery or Firestore):

pip install 'feast[gcp]'

Install Feast with AWS dependencies (required when using Redshift or DynamoDB):

pip install 'feast[aws]'

Install Feast with Redis dependencies (required when using Redis, either through AWS Elasticache or independently):

pip install 'feast[redis]'

Scaling Feast

Overview

Feast is designed to be easy to use and understand out of the box, with as few infrastructure dependencies as possible. However, there are components used by default that may not scale well. Since Feast is designed to be modular, it's possible to swap such components with more performant components, at the cost of Feast depending on additional infrastructure.

Scaling Feast Registry

The default Feast is a file-based registry. Any changes to the feature repo, or materializing data into the online store, results in a mutation to the registry.

However, there are inherent limitations with a file-based registry, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently).

The recommended solution in this case is to use the , which allows concurrent, transactional, and fine-grained updates to the registry. This registry implementation requires access to an existing database (such as MySQL, Postgres, etc).

Scaling Materialization

The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store. However, this process does not scale for large data sets, since it's executed on a single-process.

Feast supports pluggable , that allow the materialization process to be scaled up. Aside from the local process, Feast supports a , and a .

Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.

Structuring Feature Repos

A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a staging environment for your offline and online stores, which mirrors production (with potentially a smaller data set). Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.

Setting up multiple environments

There are three common ways teams approach having separate environments

Read features from the online store

The Feast Python SDK allows users to retrieve feature values from an online store. This API is used to look up feature values at low latency during model serving in order to make online predictions.

Online stores only maintain the current state of features, i.e latest feature values. No historical data is stored or served.

Retrieving online features

registry
SQL based registry
Materialization Engines
Lambda-based materialization engine
Bytewax-based materialization engine

Have separate git branches for each environment

  • Have separate feature_store.yaml files and separate Feast object definitions that correspond to each environment

  • Have separate feature_store.yaml files per environment, but share the Feast object definitions

  • Different version control branches

    To keep a clear separation of the feature repos, teams may choose to have multiple long-lived branches in their version control system, one for each environment. In this approach, with CI/CD setup, changes would first be made to the staging branch, and then copied over manually to the production branch once verified in the staging environment.

    Separate feature_store.yaml files and separate Feast object definitions

    For this approach, we have created an example repository (Feast Repository Example) which contains two Feast projects, one per environment.

    The contents of this repository are shown below:

    The repository contains three sub-folders:

    • staging/: This folder contains the staging feature_store.yaml and Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.

    • production/: This folder contains the production feature_store.yaml and Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.

    • .github: This folder is an example of a CI system that applies the changes in either the staging or production repositories using feast apply. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.

    The feature_store.yaml contains the following:

    Notice how the registry has been configured to use a Google Cloud Storage bucket. All changes made to infrastructure using feast apply are tracked in the registry.db. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.

    It is important to note that the CI system above must have access to create, modify, or remove infrastructure in your production environment. This is unlike clients of the feature store, who will only have read access.

    If your organization consists of many independent data science teams or a single group is working on several projects that could benefit from sharing features, entities, sources, and transformations, then we encourage you to utilize Python packages inside each environment:

    Shared Feast Object definitions with separate feature_store.yaml files

    This approach is very similar to the previous approach, but instead of having feast objects duplicated and having to copy over changes, it may be possible to share the same Feast object definitions and have different feature_store.yaml configuration.

    An example of how such a repository would be structured is as follows:

    Users can then apply the applying them to each environment in this way:

    This setup has the advantage that you can share the feature definitions entirely, which may prevent issues with copy-pasting code.

    Summary

    In summary, once you have set up a Git based repository with CI that runs feast apply on changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.

    1. Ensure that feature values have been loaded into the online store

    Please ensure that you have materialized (loaded) your feature values into the online store before starting

    2. Define feature references

    Create a list of features that you would like to retrieve. This list typically comes from the model training step and should accompany the model binary.

    3. Read online features

    Next, we will create a feature store object and call get_online_features() which reads the relevant feature values directly from the online store.

    features = [
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate"
    ]
    Load data into the online store
    β”œβ”€β”€ .github
    β”‚   └── workflows
    β”‚       β”œβ”€β”€ production.yml
    β”‚       └── staging.yml
    β”‚
    β”œβ”€β”€ staging
    β”‚   β”œβ”€β”€ driver_repo.py
    β”‚   └── feature_store.yaml
    β”‚
    └── production
        β”œβ”€β”€ driver_repo.py
        └── feature_store.yaml
    project: staging
    registry: gs://feast-ci-demo-registry/staging/registry.db
    provider: gcp
    └── production
        β”œβ”€β”€ common
        β”‚    β”œβ”€β”€ __init__.py
        β”‚    β”œβ”€β”€ sources.py
        β”‚    └── entities.py
        β”œβ”€β”€ ranking
        β”‚    β”œβ”€β”€ __init__.py
        β”‚    β”œβ”€β”€ views.py
        β”‚    └── transformations.py
        β”œβ”€β”€ segmentation
        β”‚    β”œβ”€β”€ __init__.py
        β”‚    β”œβ”€β”€ views.py
        β”‚    └── transformations.py
        └── feature_store.yaml
    β”œβ”€β”€ .github
    β”‚   └── workflows
    β”‚       β”œβ”€β”€ production.yml
    β”‚       └── staging.yml
    β”œβ”€β”€ staging
    β”‚   └── feature_store.yaml
    β”œβ”€β”€ production
    β”‚   └── feature_store.yaml
    └── driver_repo.py
    feast -f staging/feature_store.yaml apply
    fs = FeatureStore(repo_path="path/to/feature/repo")
    online_features = fs.get_online_features(
        features=features,
        entity_rows=[
            # {join_key: entity_value, ...}
            {"driver_id": 1001},
            {"driver_id": 1002}]
    ).to_dict()
    {
       "driver_hourly_stats__acc_rate":[
          0.2897740304470062,
          0.6447265148162842
       ],
       "driver_hourly_stats__conv_rate":[
          0.6508077383041382,
          0.14802511036396027
       ],
       "driver_id":[
          1001,
          1002
       ]
    }

    Load data into the online store

    Feast allows users to load their feature data into an online store in order to serve the latest features to models for online prediction.

    Materializing features

    1. Register feature views

    Deploy a feature store

    The Feast CLI can be used to deploy a feature store to your infrastructure, spinning up any necessary persistent resources like buckets or tables in data stores. The deployment target and effects depend on the provider that has been configured in your file, as well as the feature definitions found in your feature repository.

    Here we'll be using the example repository we created in the previous guide, . You can re-create it by running feast init in a new directory.

    Before proceeding, please ensure that you have applied (registered) the feature views that should be materialized.

    2.a Materialize

    The materialize command allows users to materialize features over a specific historical time range into the online store.

    The above command will query the batch sources for all feature views over the provided time range, and load the latest feature values into the configured online store.

    It is also possible to materialize for specific feature views by using the -v / --views argument.

    The materialize command is completely stateless. It requires the user to provide the time ranges that will be loaded into the online store. This command is best used from a scheduler that tracks state, like Airflow.

    2.b Materialize Incremental (Alternative)

    For simplicity, Feast also provides a materialize command that will only ingest new data that has arrived in the offline store. Unlike materialize, materialize-incremental will track the state of previous ingestion runs inside of the feature registry.

    The example command below will load only new data that has arrived for each feature view up to the end date and time (2021-04-08T00:00:00).

    The materialize-incremental command functions similarly to materialize in that it loads data over a specific time range for all feature views (or the selected feature views) into the online store.

    Unlike materialize, materialize-incremental automatically determines the start time from which to load features from batch sources of each feature view. The first time materialize-incremental is executed it will set the start time to the oldest timestamp of each data source, and the end time as the one provided by the user. For each run of materialize-incremental, the end timestamp will be tracked.

    Subsequent runs of materialize-incremental will then set the start time to the end time of the previous run, thus only loading new data that has arrived into the online store. Note that the end time that is tracked for each run is at the feature view level, not globally for all feature views, i.e, different feature views may have different periods that have been materialized into the online store.

    Deploy a feature store
    Deploying

    To have Feast deploy your infrastructure, run feast apply from your command line while inside a feature repository:

    Depending on whether the feature repository is configured to use a local provider or one of the cloud providers like GCP or AWS, it may take from a couple of seconds to a minute to run to completion.

    At this point, no data has been materialized to your online store. Feast apply simply registers the feature definitions with Feast and spins up any necessary infrastructure such as tables. To load data into the online store, run feast materialize. See Load data into the online store for more details.

    Cleaning up

    If you need to clean up the infrastructure created by feast apply, use the teardown command.

    Warning: teardown is an irreversible command and will remove all feature store infrastructure. Proceed with caution!

    ****

    feature_store.yaml
    Create a feature store
    feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00
    feast materialize 2021-04-07T00:00:00 2021-04-08T00:00:00 \
    --views driver_hourly_stats
    feast materialize-incremental 2021-04-08T00:00:00
    feast apply
    
    # Processing example.py as example
    # Done!
    feast teardown