staging/: This folder contains the staging
feature_store.yamland Feast objects. Users that want to make changes to the Feast deployment in the staging environment will commit changes to this directory.
production/: This folder contains the production
feature_store.yamland Feast objects. Typically users would first test changes in staging before copying the feature definitions into the production folder, before committing the changes.
.github: This folder is an example of a CI system that applies the changes in either the
feast apply. This operation saves your feature definitions to a shared registry (for example, on GCS) and configures your infrastructure for serving features.
feature_store.yamlcontains the following:
feast applyare tracked in the
registry.db. This registry will be accessed later by the Feast SDK in your training pipelines or model serving services in order to read features.
feast applyon changes, your infrastructure (offline store, online store, and cloud environment) will automatically be updated to support the loading of data into the feature store or retrieval of data.
materialize-incrementalis run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
driver_hourly_statsfeature view over a day. This command can be scheduled as the final operation in your Airflow ETL, which runs after you have computed your features and stored them in the source location. Feast will then load your feature data into your online store.
$ feast materialize. Feast keeps the history of materialization in its registry so that the choice could be as simple as a unix cron util. Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
FeatureStoreobject with a path to the registry.
feature_store.yamlto those pipelines. This
feature_store.yamlfile will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
FeatureStoreobject, fetch online features, and then make a prediction:
transformation-service. Both must have read access to the registry file on cloud storage. Both will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
foreachBatchstream writer in PySpark like this: