Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Please see Offline Store for a conceptual explanation of offline stores.
OverviewDaskSnowflakeBigQueryRedshiftDuckDBCouchbase Columnar (contrib)Spark (contrib)PostgreSQL (contrib)Trino (contrib)Azure Synapse + Azure SQL (contrib)The Dask offline store provides support for reading FileSources.
All data is downloaded and joined using Python and therefore may not scale to production workloads.
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the dask offline store.
Below is a matrix indicating which functionality is supported by DaskRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
The BigQuery offline store provides support for reading .
All joins happen within BigQuery.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to BigQuery as a table (marked for expiration) in order to complete join operations.
The Couchbase Columnar offline store provides support for reading . Note that Couchbase Columnar is available through .
Entity dataframes can be provided as a SQL++ query or can be provided as a Pandas dataframe. A Pandas dataframe will be uploaded to Couchbase Capella Columnar as a collection.
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
yes
write_logged_features (persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
type: daskIn order to use this offline store, you'll need to run pip install 'feast[gcp]'. You can get started by then running feast init -t gcp.
The full set of configuration options is available in BigQueryOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the BigQuery offline store.
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
yes
write_logged_features (persist logged features to offline store)
yes
Below is a matrix indicating which functionality is supported by BigQueryRetrievalJob.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
yes
*See GitHub issue for details on proposed solutions for enabling the BigQuery offline store to understand tables that use _PARTITIONTIME as the partition column.
To compare this set of functionality against other offline stores, please see the full functionality matrix.
The Couchbase Columnar offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[couchbase]'. You can get started by then running feast init -t couchbase.
To get started with Couchbase Capella Columnar:
Sign up for a Couchbase Capella account
Create an Access Control Account
This account should be able to read and write.
For testing purposes, it is recommended to assign all roles to avoid any permission issues.
You must allow the IP address of the machine running Feast.
Note that timeoutis an optional parameter. The full set of configuration options is available in CouchbaseColumnarOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Couchbase Columnar offline store.
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
Below is a matrix indicating which functionality is supported by CouchbaseColumnarRetrievalJob.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
yes
export to data warehouse
yes
To compare this set of functionality against other offline stores, please see the full functionality matrix.
The Clickhouse offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[clickhouse]'.
Note that use_temporary_tables_for_entity_df is an optional parameter. The full set of configuration options is available in ClickhouseOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Clickhouse offline store.
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
no
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
Below is a matrix indicating which functionality is supported by ClickhouseRetrievalJob.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
yes
export to data warehouse
yes
To compare this set of functionality against other offline stores, please see the full functionality matrix.
The Remote Offline Store is an Arrow Flight client for the offline store that implements the RemoteOfflineStore class using the existing OfflineStore interface. The client implements various methods, including get_historical_features, pull_latest_from_table_or_query, write_logged_features, and offline_write_batch.
User needs to create client side feature_store.yaml file and set the offline_store type remote and provide the server connection configuration including adding the host and specifying the port (default is 8815) required by the Arrow Flight client to connect with the Arrow Flight server.
The complete example can be find under
Please see the detail how to configure offline feature server
Please refer the for more details on how to configure authentication and authorization.
Here are the methods exposed by the OfflineStore interface, along with the core functionality supported by the method:
get_historical_features: point-in-time correct join to retrieve historical features
pull_latest_from_table_or_query: retrieve latest feature values for materialization into the online store
pull_all_from_table_or_query: retrieve a saved dataset
offline_write_batch: persist dataframes to the offline store, primarily for push sources
write_logged_features: persist logged features to the offline store, for feature logging
The first three of these methods all return a RetrievalJob specific to an offline store, such as a SnowflakeRetrievalJob. Here is a list of functionality supported by RetrievalJobs:
export to dataframe
export to arrow table
export to arrow batches (to handle large datasets in memory)
export to SQL
There are currently four core offline store implementations: DaskOfflineStore, BigQueryOfflineStore, SnowflakeOfflineStore, and RedshiftOfflineStore. There are several additional implementations contributed by the Feast community (PostgreSQLOfflineStore, SparkOfflineStore, and TrinoOfflineStore), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific offline store, such as how to configure it in a feature_store.yaml, can be found .
Below is a matrix indicating which offline stores support which methods.
Below is a matrix indicating which RetrievalJobs support what functionality.
The Snowflake offline store provides support for reading SnowflakeSources.
All joins happen within Snowflake.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Snowflake as a temporary table in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[snowflake]'.
If you're using a file based registry, then you'll also need to install the relevant cloud extra (pip install 'feast[snowflake, CLOUD]' where CLOUD is one of aws, gcp, azure)
You can get started by then running feast init -t snowflake.
The full set of configuration options is available in .
Please be aware that here is a restriction/limitation for using SQL query string in Feast with Snowflake. Try to avoid the usage of single quote in SQL query string. For example, the following query string will fail:
That 'value' will fail in Snowflake. Instead, please use pairs of dollar signs like $$value$$ as .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Snowflake offline store.
Below is a matrix indicating which functionality is supported by SnowflakeRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp
offline_store:
type: bigquery
dataset: feast_bq_datasetproject: my_project
registry: data/registry.db
provider: local
offline_store:
type: couchbase.offline
connection_string: COUCHBASE_COLUMNAR_CONNECTION_STRING # Copied from Settings > Connection String page in Capella Columnar console, starts with couchbases://
user: COUCHBASE_COLUMNAR_USER # Couchbase cluster access name from Settings > Access Control page in Capella Columnar console
password: COUCHBASE_COLUMNAR_PASSWORD # Couchbase password from Settings > Access Control page in Capella Columnar console
timeout: 120 # Timeout in seconds for Columnar operations, optional
online_store:
path: data/online_store.dbproject: my_project
registry: data/registry.db
provider: local
offline_store:
type: feast.infra.offline_stores.contrib.clickhouse_offline_store.clickhouse.ClickhouseOfflineStore
host: DB_HOST
port: DB_PORT
database: DB_NAME
user: DB_USERNAME
password: DB_PASSWORD
use_temporary_tables_for_entity_df: true
online_store:
path: data/online_store.dbexport as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data*
partial
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
export to data warehouse
export as Spark dataframe
local execution of Python-based on-demand transforms
remote execution of Python-based on-demand transforms
persist results in the offline store
preview the query plan before execution (RetrievalJobs are lazily executed)
read partitioned data
yes
yes
pull_latest_from_table_or_query
yes
yes
yes
yes
yes
yes
yes
yes
pull_all_from_table_or_query
yes
yes
yes
yes
yes
yes
yes
yes
offline_write_batch
yes
yes
yes
yes
no
no
no
no
write_logged_features
yes
yes
yes
yes
no
no
no
no
yes
yes
yes
yes
export to arrow table
yes
yes
yes
yes
yes
yes
yes
yes
yes
export to arrow batches
no
no
no
yes
no
no
no
no
no
export to SQL
no
yes
yes
yes
yes
no
yes
no
yes
export to data lake (S3, GCS, etc.)
no
no
yes
no
yes
no
no
no
yes
export to data warehouse
no
yes
yes
yes
yes
no
no
no
yes
export as Spark dataframe
no
no
yes
no
no
yes
no
no
no
local execution of Python-based on-demand transforms
yes
yes
yes
yes
yes
no
yes
yes
yes
remote execution of Python-based on-demand transforms
no
no
no
no
no
no
no
no
no
persist results in the offline store
yes
yes
yes
yes
yes
yes
no
yes
yes
preview the query plan before execution
yes
yes
yes
yes
yes
yes
yes
no
yes
read partitioned data
yes
yes
yes
yes
yes
yes
yes
yes
yes
get_historical_features
yes
yes
yes
yes
yes
export to dataframe
yes
yes
yes
yes
yes
yes
yes
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
yes
write_logged_features (persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
yes
export to SQL
yes
export to data lake (S3, GCS, etc.)
yes
export to data warehouse
yes
export as Spark dataframe
offline_store:
type: remote
host: localhost
port: 8815project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
type: snowflake.offline
account: snowflake_deployment.us-east-1
user: user_login
password: user_password
role: SYSADMIN
warehouse: COMPUTE_WH
database: FEAST
schema: PUBLICSELECT
some_column
FROM
some_table
WHERE
other_column = 'value'The duckdb offline store provides support for reading FileSources. It can read both Parquet and Delta formats. DuckDB offline store uses ibis under the hood to convert offline store operations to DuckDB queries.
Entity dataframes can be provided as a Pandas dataframe.
In order to use this offline store, you'll need to run pip install 'feast[duckdb]'.
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the DuckDB offline store.
Below is a matrix indicating which functionality is supported by IbisRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
no
read partitioned data
yes
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
yes
write_logged_features (persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: duckdb
online_store:
path: data/online_store.dbThe Spark offline store provides support for reading SparkSources.
Entity dataframes can be provided as a SQL query, Pandas dataframe or can be provided as a Pyspark dataframe. A Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
The Spark offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[spark]'. You can get started by then running feast init -t spark.
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.
Below is a matrix indicating which functionality is supported by SparkRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
The PostgreSQL offline store provides support for reading PostgreSQLSources.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Postgres as a table in order to complete join operations.
The PostgreSQL offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[postgres]'. You can get started by then running feast init -t postgres.
Note that sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional parameters. The full set of configuration options is available in .
Additionally, a new optional parameter entity_select_mode was added to tell how Postgres should load the entity data. By default(temp_table), a temporary table is created and the entity data frame or sql is loaded into that table. A new value of embed_query was added to allow directly loading the SQL query into a CTE, providing improved performance and skipping the need to CREATE and DROP the temporary table.
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the PostgreSQL offline store.
Below is a matrix indicating which functionality is supported by PostgreSQLRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
yes
local execution of Python-based on-demand transforms
no
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
export as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
yes
export to data warehouse
yes
export as Spark dataframe
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: spark
spark_conf:
spark.master: "local[*]"
spark.ui.enabled: "false"
spark.eventLog.enabled: "false"
spark.sql.catalogImplementation: "hive"
spark.sql.parser.quotedRegexColumnNames: "true"
spark.sql.session.timeZone: "UTC"
spark.sql.execution.arrow.fallback.enabled: "true"
spark.sql.execution.arrow.pyspark.enabled: "true"
online_store:
path: data/online_store.dbproject: my_project
registry: data/registry.db
provider: local
offline_store:
type: postgres
host: DB_HOST
port: DB_PORT
database: DB_NAME
db_schema: DB_SCHEMA
user: DB_USERNAME
password: DB_PASSWORD
sslmode: verify-ca
sslkey_path: /path/to/client-key.pem
sslcert_path: /path/to/client-cert.pem
sslrootcert_path: /path/to/server-ca.pem
entity_select_mode: temp_table
online_store:
path: data/online_store.dbThe Trino offline store does not achieve full test coverage. Please do not assume complete stability.
In order to use this offline store, you'll need to run pip install 'feast[trino]'. You can then run feast init, then swap out feature_store.yaml with the below example to connect to Trino.
The full set of configuration options is available in TrinoOfflineStoreConfig.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Trino offline store.
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
Below is a matrix indicating which functionality is supported by TrinoRetrievalJob.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
To compare this set of functionality against other offline stores, please see the full functionality matrix.
In order to use this offline store, you'll need to run pip install 'feast[azure]'. You can get started by then following this tutorial.
The MsSQL offline store does not achieve full test coverage. Please do not assume complete stability.
The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Spark offline store.
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
no
write_logged_features (persist logged features to offline store)
no
Below is a matrix indicating which functionality is supported by MsSqlServerRetrievalJob.
export to dataframe
yes
export to arrow table
yes
export to arrow batches
no
export to SQL
no
export to data lake (S3, GCS, etc.)
no
export to data warehouse
no
To compare this set of functionality against other offline stores, please see the full functionality matrix.
project: feature_repo
project_description: This Feast project is a Trino Offline Store demo.
provider: local
registry: data/registry.db
offline_store:
type: trino
host: ${TRINO_HOST}
port: ${TRINO_PORT}
http-scheme: http
ssl-verify: false
catalog: hive
dataset: ${DATASET_NAME}
# Hive connection as example
connector:
type: hive
file_format: parquet
user: trino
# Enables authentication in Trino connections, pick the one you need
auth:
# Basic Auth
type: basic
config:
username: ${TRINO_USER}
password: ${TRINO_PWD}
# Certificate
type: certificate
config:
cert-file: /path/to/cert/file
key-file: /path/to/key/file
# JWT
type: jwt
config:
token: ${JWT_TOKEN}
# OAuth2 (no config required)
type: oauth2
# Kerberos
type: kerberos
config:
config-file: /path/to/kerberos/config/file
service-name: foo
mutual-authentication: true
force-preemptive: true
hostname-override: custom-hostname
sanitize-mutual-error-response: true
principal: principal-name
delegate: true
ca_bundle: /path/to/ca/bundle/file
online_store:
path: data/online_store.db
# Prevents "Unsupported Hive type: timestamp(3) with time zone" TrinoUserError
coerce_tz_aware: false
entity_key_serialization_version: 3
auth:
type: no_authregistry:
registry_store_type: AzureRegistryStore
path: ${REGISTRY_PATH} # Environment Variable
project: production
provider: azure
online_store:
type: redis
connection_string: ${REDIS_CONN} # Environment Variable
offline_store:
type: mssql
connection_string: ${SQL_CONN} # Environment Variableexport as Spark dataframe
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
no
preview the query plan before execution
yes
read partitioned data
yes
local execution of Python-based on-demand transforms
no
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
The Redshift offline store provides support for reading RedshiftSources.
All joins happen within Redshift.
Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Redshift temporarily in order to complete join operations.
In order to use this offline store, you'll need to run pip install 'feast[aws]'. You can get started by then running feast init -t aws.
The full set of configuration options is available in .
The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Redshift offline store.
Below is a matrix indicating which functionality is supported by RedshiftRetrievalJob.
To compare this set of functionality against other offline stores, please see the full .
Feast requires the following permissions in order to execute commands for Redshift offline store:
The following inline policy can be used to grant Feast the necessary permissions:
In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run and commands. Once created, this IAM role needs to be configured in feature_store.yaml file as offline_store: iam_role.
The following inline policy can be used to grant Redshift necessary permissions to access S3:
While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:
In order to use , specify a workgroup instead of a cluster_id and user.
Please note that the IAM policies above will need the version, rather than the standard .
no
local execution of Python-based on-demand transforms
yes
remote execution of Python-based on-demand transforms
no
persist results in the offline store
yes
preview the query plan before execution
yes
read partitioned data
yes
redshift-data:ExecuteStatement
redshift:GetClusterCredentials
arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>
arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Get Historical Features
redshift-data:DescribeStatement
*
Get Historical Features
s3:ListBucket
s3:GetObject
s3:PutObject
s3:DeleteObject
arn:aws:s3:::<bucket_name>
arn:aws:s3:::<bucket_name>/*
get_historical_features (point-in-time correct join)
yes
pull_latest_from_table_or_query (retrieve latest feature values)
yes
pull_all_from_table_or_query (retrieve a saved dataset)
yes
offline_write_batch (persist dataframes to offline store)
yes
write_logged_features (persist logged features to offline store)
yes
export to dataframe
yes
export to arrow table
yes
export to arrow batches
yes
export to SQL
yes
export to data lake (S3, GCS, etc.)
no
export to data warehouse
yes
Command
Permissions
Resources
Apply
redshift-data:DescribeTable
redshift:GetClusterCredentials
arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>
arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Materialize
redshift-data:ExecuteStatement
arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>
Materialize
redshift-data:DescribeStatement
*
Materialize
s3:ListBucket
s3:GetObject
s3:DeleteObject
arn:aws:s3:::<bucket_name>
arn:aws:s3:::<bucket_name>/*
export as Spark dataframe
Get Historical Features
project: my_feature_repo
registry: data/registry.db
provider: aws
offline_store:
type: redshift
region: us-west-2
cluster_id: feast-cluster
database: feast-database
user: redshift-user
s3_staging_location: s3://feast-bucket/redshift
iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<bucket_name>/*",
"arn:aws:s3:::<bucket_name>"
]
},
{
"Action": [
"redshift-data:DescribeTable",
"redshift:GetClusterCredentials",
"redshift-data:ExecuteStatement"
],
"Effect": "Allow",
"Resource": [
"arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>",
"arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>",
"arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>"
]
},
{
"Action": [
"redshift-data:DescribeStatement"
],
"Effect": "Allow",
"Resource": "*"
}
],
"Version": "2012-10-17"
}{
"Statement": [
{
"Action": "s3:*",
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::feast-int-bucket",
"arn:aws:s3:::feast-int-bucket/*"
]
}
],
"Version": "2012-10-17"
}{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}project: my_feature_repo
registry: data/registry.db
provider: aws
offline_store:
type: redshift
region: us-west-2
workgroup: feast-workgroup
database: feast-database
s3_staging_location: s3://feast-bucket/redshift
iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role⚠️ Contrib Plugin: The Ray offline store is a contributed plugin. It may not be as stable or fully supported as core offline stores. Use with caution in production and report issues to the Feast community.
The Ray offline store is a data I/O implementation that leverages Ray for reading and writing data from various sources. It focuses on efficient data access operations, while complex feature computation is handled by the Ray Compute Engine.
The Ray offline store provides:
Ray-based data reading from file sources (Parquet, CSV, etc.)
Support for both local and distributed Ray clusters
Integration with various storage backends (local files, S3, GCS, HDFS)
Efficient data filtering and column selection
Timestamp-based data processing with timezone awareness
By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.
For testing and local experimentation, we strongly recommend:
Configure resource limits in your feature_store.yaml (see section below)
This will limit Ray to safe resource levels for testing and development.
The Ray offline store follows Feast's architectural separation:
Ray Offline Store: Handles data I/O operations (reading/writing data)
Ray Compute Engine: Handles complex feature computation and joins
Clear Separation: Each component has a single responsibility
For complex feature processing, historical feature retrieval, and distributed joins, use the .
The Ray offline store can be configured in your feature_store.yaml file. Below are two main configuration patterns:
For simple data I/O operations without distributed processing:
For distributed feature processing with advanced capabilities:
For local development and testing:
For production deployments with distributed Ray cluster:
For Ray compute engine configuration options, see the .
By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.
For custom resource control, configure limits in your feature_store.yaml:
The Ray offline store provides direct access to underlying data:
The Ray offline store supports batch writing for materialization:
The Ray offline store supports persisting datasets for later analysis:
The Ray offline store supports various remote storage backends:
To use Ray in cluster mode for distributed data access:
Start a Ray cluster:
Configure your feature_store.yaml:
For multiple worker nodes:
The Ray offline store validates data sources to ensure compatibility:
The Ray offline store has the following limitations:
File Sources Only: Currently supports only FileSource data sources
No Direct SQL: Does not support SQL query interfaces
No Online Writes: Cannot write directly to online stores
No Complex Transformations
For complex feature processing operations, use the Ray offline store in combination with the . See the Ray Offline Store + Compute Engine configuration example in the section above for a complete setup.
For more advanced troubleshooting, refer to the .
Basic Ray Offline Store (local development):
Ray Offline Store + Compute Engine (distributed processing):
For complete examples, see the section above.
ray_conf
dict
None
Ray initialization parameters for resource management (e.g., memory, CPU limits)
enable_ray_logging
false
Enable Ray progress bars and logging
false
get_historical_features
Yes
pull_latest_from_table_or_query
Yes
pull_all_from_table_or_query
Yes
offline_write_batch
Yes
write_logged_features
Yes
export to dataframe
Yes
export to arrow table
Yes
persist results in offline store
Yes
local execution of ODFVs
Yes
preview query plan
Yes
read partitioned data
Yes
type
string
Required
Must be feast.offline_stores.contrib.ray_offline_store.ray.RayOfflineStore or ray
storage_path
string
None
Path for storing temporary files and datasets
ray_address
string
None
broadcast_join_threshold_mb
100
Size threshold for broadcast joins (MB)
25
max_parallelism_multiplier
2
Parallelism as multiple of CPU cores
1
target_partition_size_mb
64
Target partition size (MB)
Address of the Ray cluster (e.g., "localhost:10001")
16
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: ray
storage_path: data/ray_storage # Optional: Path for storing datasets
ray_address: localhost:10001 # Optional: Ray cluster addressproject: my_project
registry: data/registry.db
provider: local
# Ray offline store for data I/O operations
offline_store:
type: ray
storage_path: s3://my-bucket/feast-data # Optional: Path for storing datasets
ray_address: localhost:10001 # Optional: Ray cluster address
# Ray compute engine for distributed feature processing
batch_engine:
type: ray.engine
# Resource configuration
max_workers: 8 # Maximum number of Ray workers
max_parallelism_multiplier: 2 # Parallelism as multiple of CPU cores
# Performance optimization
enable_optimization: true # Enable performance optimizations
broadcast_join_threshold_mb: 100 # Broadcast join threshold (MB)
target_partition_size_mb: 64 # Target partition size (MB)
# Distributed join configuration
window_size_for_joins: "1H" # Time window for distributed joins
enable_distributed_joins: true # Enable distributed joins
# Ray cluster configuration (optional)
ray_address: localhost:10001 # Ray cluster address
staging_location: s3://my-bucket/staging # Remote staging locationproject: my_local_project
registry: data/registry.db
provider: local
offline_store:
type: ray
storage_path: ./data/ray_storage
# Conservative settings for local development
broadcast_join_threshold_mb: 25
max_parallelism_multiplier: 1
target_partition_size_mb: 16
enable_ray_logging: false
# Memory constraints to prevent OOM in test/development environments
ray_conf:
num_cpus: 1
object_store_memory: 104857600 # 100MB
_memory: 524288000 # 500MB
batch_engine:
type: ray.engine
max_workers: 2
enable_optimization: falseproject: my_production_project
registry: s3://my-bucket/registry.db
provider: local
offline_store:
type: ray
storage_path: s3://my-production-bucket/feast-data
ray_address: "ray://production-head-node:10001"
batch_engine:
type: ray.engine
max_workers: 32
max_parallelism_multiplier: 4
enable_optimization: true
broadcast_join_threshold_mb: 50
target_partition_size_mb: 128
window_size_for_joins: "30min"
ray_address: "ray://production-head-node:10001"
staging_location: s3://my-production-bucket/stagingoffline_store:
type: ray
storage_path: ./data/ray_storage
# Resource optimization settings
broadcast_join_threshold_mb: 25 # Smaller datasets for broadcast joins
max_parallelism_multiplier: 1 # Reduced parallelism
target_partition_size_mb: 16 # Smaller partition sizes
enable_ray_logging: false # Disable verbose logging
# Memory constraints to prevent OOM in test environments
ray_conf:
num_cpus: 1
object_store_memory: 104857600 # 100MB
_memory: 524288000 # 500MBoffline_store:
type: ray
storage_path: s3://my-bucket/feast-data
ray_address: "ray://production-cluster:10001"
# Optimized for production workloads
broadcast_join_threshold_mb: 100
max_parallelism_multiplier: 2
target_partition_size_mb: 64
enable_ray_logging: true# feature_store.yaml
offline_store:
type: ray
broadcast_join_threshold_mb: 25
max_parallelism_multiplier: 1
target_partition_size_mb: 16# feature_store.yaml
offline_store:
type: ray
ray_address: "ray://cluster-head:10001"
broadcast_join_threshold_mb: 200
max_parallelism_multiplier: 4from feast import FeatureStore, FeatureView, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
# Define a feature view
driver_stats = FeatureView(
name="driver_stats",
entities=["driver_id"],
ttl=timedelta(days=1),
source=FileSource(
path="data/driver_stats.parquet",
timestamp_field="event_timestamp",
),
schema=[
("driver_id", Int64),
("avg_daily_trips", Float32),
],
)
# Initialize feature store
store = FeatureStore("feature_store.yaml")
# The Ray offline store handles data I/O operations
# For complex feature computation, use Ray Compute Enginefrom feast.infra.offline_stores.contrib.ray_offline_store.ray import RayOfflineStore
from datetime import datetime, timedelta
# Pull latest data from a table
job = RayOfflineStore.pull_latest_from_table_or_query(
config=store.config,
data_source=driver_stats.source,
join_key_columns=["driver_id"],
feature_name_columns=["avg_daily_trips"],
timestamp_field="event_timestamp",
created_timestamp_column=None,
start_date=datetime.now() - timedelta(days=7),
end_date=datetime.now(),
)
# Convert to pandas DataFrame
df = job.to_df()
print(f"Retrieved {len(df)} rows")
# Convert to Arrow Table
arrow_table = job.to_arrow()
# Get Ray dataset directly
ray_dataset = job.to_ray_dataset()import pyarrow as pa
from feast import FeatureView
# Create sample data
data = pa.table({
"driver_id": [1, 2, 3, 4, 5],
"avg_daily_trips": [10.5, 15.2, 8.7, 12.3, 9.8],
"event_timestamp": [datetime.now()] * 5
})
# Write batch data
RayOfflineStore.offline_write_batch(
config=store.config,
feature_view=driver_stats,
table=data,
progress=lambda x: print(f"Wrote {x} rows")
)from feast.infra.offline_stores.file_source import SavedDatasetFileStorage
# Create storage destination
storage = SavedDatasetFileStorage(path="data/training_dataset.parquet")
# Persist the dataset
job.persist(storage, allow_overwrite=False)
# Create a saved dataset in the registry
saved_dataset = store.create_saved_dataset(
from_=job,
name="driver_training_dataset",
storage=storage,
tags={"purpose": "data_access", "version": "v1"}
)
print(f"Saved dataset created: {saved_dataset.name}")# S3 storage
s3_storage = SavedDatasetFileStorage(path="s3://my-bucket/datasets/driver_features.parquet")
job.persist(s3_storage, allow_overwrite=True)
# Google Cloud Storage
gcs_storage = SavedDatasetFileStorage(path="gs://my-project-bucket/datasets/driver_features.parquet")
job.persist(gcs_storage, allow_overwrite=True)
# HDFS
hdfs_storage = SavedDatasetFileStorage(path="hdfs://namenode:8020/datasets/driver_features.parquet")
job.persist(hdfs_storage, allow_overwrite=True)ray start --head --port=10001offline_store:
type: ray
ray_address: localhost:10001
storage_path: s3://my-bucket/features# On worker nodes
ray start --address='head-node-ip:10001'from feast.infra.offline_stores.contrib.ray_offline_store.ray import RayOfflineStore
# Validate a data source
try:
RayOfflineStore.validate_data_source(store.config, driver_stats.source)
print("Data source is valid")
except Exception as e:
print(f"Data source validation failed: {e}")offline_store:
type: ray
storage_path: ./data/ray_storage
# Conservative settings for local development
broadcast_join_threshold_mb: 25
max_parallelism_multiplier: 1
target_partition_size_mb: 16
enable_ray_logging: falseoffline_store:
type: ray
storage_path: s3://my-bucket/feast-data
batch_engine:
type: ray.engine
max_workers: 8
enable_optimization: true
broadcast_join_threshold_mb: 100# Initialize feature store
store = FeatureStore("feature_store.yaml")
# Get historical features (uses compute engine if configured)
features = store.get_historical_features(entity_df=df, features=["fv:feature"])
# Direct data access (uses offline store)
job = RayOfflineStore.pull_latest_from_table_or_query(...)
df = job.to_df()
# Offline write batch (materialization)
# Create sample data for materialization
data = pa.table({
"driver_id": [1, 2, 3, 4, 5],
"avg_daily_trips": [10.5, 15.2, 8.7, 12.3, 9.8],
"event_timestamp": [datetime.now()] * 5
})
# Write batch to offline store
RayOfflineStore.offline_write_batch(
config=store.config,
feature_view=driver_stats_fv,
table=data,
progress=lambda rows: print(f"Processed {rows} rows")
)