All pages
Powered by GitBook
1 of 15

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Offline stores

Please see Offline Store for a conceptual explanation of offline stores.

OverviewDaskSnowflakeBigQueryRedshiftDuckDBCouchbase Columnar (contrib)Spark (contrib)PostgreSQL (contrib)Trino (contrib)Azure Synapse + Azure SQL (contrib)

Dask

Description

The Dask offline store provides support for reading FileSources.

All data is downloaded and joined using Python and therefore may not scale to production workloads.

Example

The full set of configuration options is available in .

Functionality Matrix

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the dask offline store.

Dask

Below is a matrix indicating which functionality is supported by DaskRetrievalJob.

Dask

To compare this set of functionality against other offline stores, please see the full .

BigQuery

Description

The BigQuery offline store provides support for reading .

  • All joins happen within BigQuery.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to BigQuery as a table (marked for expiration) in order to complete join operations.

Couchbase Columnar (contrib)

Description

The Couchbase Columnar offline store provides support for reading . Note that Couchbase Columnar is available through .

  • Entity dataframes can be provided as a SQL++ query or can be provided as a Pandas dataframe. A Pandas dataframe will be uploaded to Couchbase Capella Columnar as a collection.

Clickhouse (contrib)

Description

The Clickhouse offline store provides support for reading .

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Clickhouse as a table (temporary table by default) in order to complete join operations.

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

no

export to data lake (S3, GCS, etc.)

no

export to data warehouse

no

DaskOfflineStoreConfig
here
functionality matrix

export as Spark dataframe

feature_store.yaml
project: my_feature_repo
registry: data/registry.db
provider: local
offline_store:
  type: dask

Getting started

In order to use this offline store, you'll need to run pip install 'feast[gcp]'. You can get started by then running feast init -t gcp.

Example

The full set of configuration options is available in BigQueryOfflineStoreConfig.

Functionality Matrix

The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the BigQuery offline store.

BigQuery

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

yes

write_logged_features (persist logged features to offline store)

yes

Below is a matrix indicating which functionality is supported by BigQueryRetrievalJob.

BigQuery

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

no

export to data warehouse

yes

*See GitHub issue for details on proposed solutions for enabling the BigQuery offline store to understand tables that use _PARTITIONTIME as the partition column.

To compare this set of functionality against other offline stores, please see the full functionality matrix.

BigQuerySources
Disclaimer

The Couchbase Columnar offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[couchbase]'. You can get started by then running feast init -t couchbase.

To get started with Couchbase Capella Columnar:

  1. Sign up for a Couchbase Capella account

  2. Deploy a Columnar cluster

  3. Create an Access Control Account

    • This account should be able to read and write.

    • For testing purposes, it is recommended to assign all roles to avoid any permission issues.

    • You must allow the IP address of the machine running Feast.

Example

Note that timeoutis an optional parameter. The full set of configuration options is available in CouchbaseColumnarOfflineStoreConfig.

Functionality Matrix

The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Couchbase Columnar offline store.

Couchbase Columnar

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

yes

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

Below is a matrix indicating which functionality is supported by CouchbaseColumnarRetrievalJob.

Couchbase Columnar

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

yes

export to data warehouse

yes

To compare this set of functionality against other offline stores, please see the full functionality matrix.

CouchbaseColumnarSources
Couchbase Capella
Disclaimer

The Clickhouse offline store does not achieve full test coverage. Please do not assume complete stability.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[clickhouse]'.

Example

Note that use_temporary_tables_for_entity_df is an optional parameter. The full set of configuration options is available in ClickhouseOfflineStoreConfig.

Functionality Matrix

The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Clickhouse offline store.

Clickhouse

get_historical_features (point-in-time correct join)

yes

pull_latest_from_table_or_query (retrieve latest feature values)

yes

pull_all_from_table_or_query (retrieve a saved dataset)

no

offline_write_batch (persist dataframes to offline store)

no

write_logged_features (persist logged features to offline store)

no

Below is a matrix indicating which functionality is supported by ClickhouseRetrievalJob.

Clickhouse

export to dataframe

yes

export to arrow table

yes

export to arrow batches

no

export to SQL

yes

export to data lake (S3, GCS, etc.)

yes

export to data warehouse

yes

To compare this set of functionality against other offline stores, please see the full functionality matrix.

ClickhouseSource

Remote Offline

Description

The Remote Offline Store is an Arrow Flight client for the offline store that implements the RemoteOfflineStore class using the existing OfflineStore interface. The client implements various methods, including get_historical_features, pull_latest_from_table_or_query, write_logged_features, and offline_write_batch.

How to configure the client

User needs to create client side feature_store.yaml file and set the offline_store type remote and provide the server connection configuration including adding the host and specifying the port (default is 8815) required by the Arrow Flight client to connect with the Arrow Flight server.

Client Example

The complete example can be find under

How to configure the server

Please see the detail how to configure offline feature server

How to configure Authentication and Authorization

Please refer the for more details on how to configure authentication and authorization.

Overview

Functionality

Here are the methods exposed by the OfflineStore interface, along with the core functionality supported by the method:

  • get_historical_features: point-in-time correct join to retrieve historical features

  • pull_latest_from_table_or_query: retrieve latest feature values for materialization into the online store

  • pull_all_from_table_or_query: retrieve a saved dataset

  • offline_write_batch: persist dataframes to the offline store, primarily for push sources

  • write_logged_features: persist logged features to the offline store, for feature logging

The first three of these methods all return a RetrievalJob specific to an offline store, such as a SnowflakeRetrievalJob. Here is a list of functionality supported by RetrievalJobs:

  • export to dataframe

  • export to arrow table

  • export to arrow batches (to handle large datasets in memory)

  • export to SQL

Functionality Matrix

There are currently four core offline store implementations: DaskOfflineStore, BigQueryOfflineStore, SnowflakeOfflineStore, and RedshiftOfflineStore. There are several additional implementations contributed by the Feast community (PostgreSQLOfflineStore, SparkOfflineStore, and TrinoOfflineStore), which are not guaranteed to be stable or to match the functionality of the core implementations. Details for each specific offline store, such as how to configure it in a feature_store.yaml, can be found .

Below is a matrix indicating which offline stores support which methods.

Dask
BigQuery
Snowflake
Redshift
Postgres
Spark
Trino
Couchbase

Below is a matrix indicating which RetrievalJobs support what functionality.

Dask
BigQuery
Snowflake
Redshift
Postgres
Spark
Trino
DuckDB
Couchbase

Snowflake

Description

The Snowflake offline store provides support for reading SnowflakeSources.

  • All joins happen within Snowflake.

  • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Snowflake as a temporary table in order to complete join operations.

Getting started

In order to use this offline store, you'll need to run pip install 'feast[snowflake]'.

If you're using a file based registry, then you'll also need to install the relevant cloud extra (pip install 'feast[snowflake, CLOUD]' where CLOUD is one of aws, gcp, azure)

You can get started by then running feast init -t snowflake.

Example

The full set of configuration options is available in .

Limitation

Please be aware that here is a restriction/limitation for using SQL query string in Feast with Snowflake. Try to avoid the usage of single quote in SQL query string. For example, the following query string will fail:

That 'value' will fail in Snowflake. Instead, please use pairs of dollar signs like $$value$$ as .

Functionality Matrix

The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Snowflake offline store.

Snowflake

Below is a matrix indicating which functionality is supported by SnowflakeRetrievalJob.

Snowflake

To compare this set of functionality against other offline stores, please see the full .

feature_store.yaml
project: my_feature_repo
registry: gs://my-bucket/data/registry.db
provider: gcp
offline_store:
  type: bigquery
  dataset: feast_bq_dataset
feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
  type: couchbase.offline
  connection_string: COUCHBASE_COLUMNAR_CONNECTION_STRING # Copied from Settings > Connection String page in Capella Columnar console, starts with couchbases://
  user: COUCHBASE_COLUMNAR_USER # Couchbase cluster access name from Settings > Access Control page in Capella Columnar console
  password: COUCHBASE_COLUMNAR_PASSWORD # Couchbase password from Settings > Access Control page in Capella Columnar console
  timeout: 120 # Timeout in seconds for Columnar operations, optional
online_store:
    path: data/online_store.db
feature_store.yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
  type: feast.infra.offline_stores.contrib.clickhouse_offline_store.clickhouse.ClickhouseOfflineStore
  host: DB_HOST
  port: DB_PORT
  database: DB_NAME
  user: DB_USERNAME
  password: DB_PASSWORD
  use_temporary_tables_for_entity_df: true
online_store:
    path: data/online_store.db

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data*

partial

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

export as Spark dataframe

no

local execution of Python-based on-demand transforms

yes

remote execution of Python-based on-demand transforms

no

persist results in the offline store

yes

preview the query plan before execution

yes

read partitioned data

yes

Configure allowed IP addresses
export to data lake (S3, GCS, etc.)
  • export to data warehouse

  • export as Spark dataframe

  • local execution of Python-based on-demand transforms

  • remote execution of Python-based on-demand transforms

  • persist results in the offline store

  • preview the query plan before execution (RetrievalJobs are lazily executed)

  • read partitioned data

  • yes

    yes

    pull_latest_from_table_or_query

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    pull_all_from_table_or_query

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    offline_write_batch

    yes

    yes

    yes

    yes

    no

    no

    no

    no

    write_logged_features

    yes

    yes

    yes

    yes

    no

    no

    no

    no

    yes

    yes

    yes

    yes

    export to arrow table

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    export to arrow batches

    no

    no

    no

    yes

    no

    no

    no

    no

    no

    export to SQL

    no

    yes

    yes

    yes

    yes

    no

    yes

    no

    yes

    export to data lake (S3, GCS, etc.)

    no

    no

    yes

    no

    yes

    no

    no

    no

    yes

    export to data warehouse

    no

    yes

    yes

    yes

    yes

    no

    no

    no

    yes

    export as Spark dataframe

    no

    no

    yes

    no

    no

    yes

    no

    no

    no

    local execution of Python-based on-demand transforms

    yes

    yes

    yes

    yes

    yes

    no

    yes

    yes

    yes

    remote execution of Python-based on-demand transforms

    no

    no

    no

    no

    no

    no

    no

    no

    no

    persist results in the offline store

    yes

    yes

    yes

    yes

    yes

    yes

    no

    yes

    yes

    preview the query plan before execution

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    no

    yes

    read partitioned data

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    get_historical_features

    yes

    yes

    yes

    yes

    yes

    export to dataframe

    yes

    yes

    yes

    yes

    here

    yes

    yes

    remote-offline-store-example
    offline-feature-server.md
    page

    yes

    local execution of Python-based on-demand transforms

    yes

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    preview the query plan before execution

    yes

    read partitioned data

    yes

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    yes

    write_logged_features (persist logged features to offline store)

    yes

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    yes

    export to SQL

    yes

    export to data lake (S3, GCS, etc.)

    yes

    export to data warehouse

    yes

    SnowflakeOfflineStoreConfig
    mentioned in Snowflake document
    here
    functionality matrix

    export as Spark dataframe

    feature_store.yaml
    offline_store:
      type: remote
      host: localhost
      port: 8815
    feature_store.yaml
    project: my_feature_repo
    registry: data/registry.db
    provider: local
    offline_store:
      type: snowflake.offline
      account: snowflake_deployment.us-east-1
      user: user_login
      password: user_password
      role: SYSADMIN
      warehouse: COMPUTE_WH
      database: FEAST
      schema: PUBLIC
    SELECT
        some_column
    FROM
        some_table
    WHERE
        other_column = 'value'

    DuckDB

    Description

    The duckdb offline store provides support for reading FileSources. It can read both Parquet and Delta formats. DuckDB offline store uses ibis under the hood to convert offline store operations to DuckDB queries.

    • Entity dataframes can be provided as a Pandas dataframe.

    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[duckdb]'.

    Example

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the DuckDB offline store.

    DuckdDB

    Below is a matrix indicating which functionality is supported by IbisRetrievalJob.

    DuckDB

    To compare this set of functionality against other offline stores, please see the full .

    no

    local execution of Python-based on-demand transforms

    yes

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    preview the query plan before execution

    no

    read partitioned data

    yes

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    yes

    write_logged_features (persist logged features to offline store)

    yes

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    no

    export to SQL

    no

    export to data lake (S3, GCS, etc.)

    no

    export to data warehouse

    no

    here
    functionality matrix

    export as Spark dataframe

    feature_store.yaml
    project: my_project
    registry: data/registry.db
    provider: local
    offline_store:
        type: duckdb
    online_store:
        path: data/online_store.db

    Spark (contrib)

    Description

    The Spark offline store provides support for reading SparkSources.

    • Entity dataframes can be provided as a SQL query, Pandas dataframe or can be provided as a Pyspark dataframe. A Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.

    Disclaimer

    The Spark offline store does not achieve full test coverage. Please do not assume complete stability.

    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[spark]'. You can get started by then running feast init -t spark.

    Example

    The full set of configuration options is available in .

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Spark offline store.

    Spark

    Below is a matrix indicating which functionality is supported by SparkRetrievalJob.

    Spark

    To compare this set of functionality against other offline stores, please see the full .

    PostgreSQL (contrib)

    Description

    The PostgreSQL offline store provides support for reading PostgreSQLSources.

    • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Postgres as a table in order to complete join operations.

    Disclaimer

    The PostgreSQL offline store does not achieve full test coverage. Please do not assume complete stability.

    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[postgres]'. You can get started by then running feast init -t postgres.

    Example

    Note that sslmode, sslkey_path, sslcert_path, and sslrootcert_path are optional parameters. The full set of configuration options is available in .

    Additionally, a new optional parameter entity_select_mode was added to tell how Postgres should load the entity data. By default(temp_table), a temporary table is created and the entity data frame or sql is loaded into that table. A new value of embed_query was added to allow directly loading the SQL query into a CTE, providing improved performance and skipping the need to CREATE and DROP the temporary table.

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the PostgreSQL offline store.

    Postgres

    Below is a matrix indicating which functionality is supported by PostgreSQLRetrievalJob.

    Postgres

    To compare this set of functionality against other offline stores, please see the full .

    Trino (contrib)

    Description

    The Trino offline store provides support for reading .

    • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Trino as a table in order to complete join operations.

    Azure Synapse + Azure SQL (contrib)

    Description

    The MsSQL offline store provides support for reading . Specifically, it is developed to read from on Microsoft Azure

    • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe.

    yes

    local execution of Python-based on-demand transforms

    no

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    preview the query plan before execution

    yes

    read partitioned data

    yes

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    no

    write_logged_features (persist logged features to offline store)

    no

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    no

    export to SQL

    no

    export to data lake (S3, GCS, etc.)

    no

    export to data warehouse

    no

    SparkOfflineStoreConfig
    here
    functionality matrix

    export as Spark dataframe

    no

    local execution of Python-based on-demand transforms

    yes

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    preview the query plan before execution

    yes

    read partitioned data

    yes

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    no

    write_logged_features (persist logged features to offline store)

    no

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    no

    export to SQL

    yes

    export to data lake (S3, GCS, etc.)

    yes

    export to data warehouse

    yes

    PostgreSQLOfflineStoreConfig
    here
    functionality matrix

    export as Spark dataframe

    feature_store.yaml
    project: my_project
    registry: data/registry.db
    provider: local
    offline_store:
        type: spark
        spark_conf:
            spark.master: "local[*]"
            spark.ui.enabled: "false"
            spark.eventLog.enabled: "false"
            spark.sql.catalogImplementation: "hive"
            spark.sql.parser.quotedRegexColumnNames: "true"
            spark.sql.session.timeZone: "UTC"
            spark.sql.execution.arrow.fallback.enabled: "true"
            spark.sql.execution.arrow.pyspark.enabled: "true"
    online_store:
        path: data/online_store.db
    feature_store.yaml
    project: my_project
    registry: data/registry.db
    provider: local
    offline_store:
      type: postgres
      host: DB_HOST
      port: DB_PORT
      database: DB_NAME
      db_schema: DB_SCHEMA
      user: DB_USERNAME
      password: DB_PASSWORD
      sslmode: verify-ca
      sslkey_path: /path/to/client-key.pem
      sslcert_path: /path/to/client-cert.pem
      sslrootcert_path: /path/to/server-ca.pem
      entity_select_mode: temp_table
    online_store:
        path: data/online_store.db
    Disclaimer

    The Trino offline store does not achieve full test coverage. Please do not assume complete stability.

    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[trino]'. You can then run feast init, then swap out feature_store.yaml with the below example to connect to Trino.

    Example

    The full set of configuration options is available in TrinoOfflineStoreConfig.

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Trino offline store.

    Trino

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    no

    write_logged_features (persist logged features to offline store)

    no

    Below is a matrix indicating which functionality is supported by TrinoRetrievalJob.

    Trino

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    no

    export to SQL

    yes

    export to data lake (S3, GCS, etc.)

    no

    export to data warehouse

    no

    To compare this set of functionality against other offline stores, please see the full functionality matrix.

    TrinoSources
    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[azure]'. You can get started by then following this tutorial.

    Disclaimer

    The MsSQL offline store does not achieve full test coverage. Please do not assume complete stability.

    Example

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail here. Below is a matrix indicating which functionality is supported by the Spark offline store.

    MsSql

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    no

    write_logged_features (persist logged features to offline store)

    no

    Below is a matrix indicating which functionality is supported by MsSqlServerRetrievalJob.

    MsSql

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    no

    export to SQL

    no

    export to data lake (S3, GCS, etc.)

    no

    export to data warehouse

    no

    To compare this set of functionality against other offline stores, please see the full functionality matrix.

    MsSQL Sources
    Synapse SQL
    feature_store.yaml
    project: feature_repo
    project_description: This Feast project is a Trino Offline Store demo.
    provider: local
    registry: data/registry.db
    offline_store:
    	type: trino
    	host: ${TRINO_HOST}
    	port: ${TRINO_PORT}
    	http-scheme: http
    	ssl-verify: false
    	catalog: hive
    	dataset: ${DATASET_NAME}
        # Hive connection as example
    	connector:
    		type: hive
    		file_format: parquet
    	user: trino
    		# Enables authentication in Trino connections, pick the one you need
        auth:
            # Basic Auth
            type: basic
            config:
                username: ${TRINO_USER}
                password: ${TRINO_PWD}
    
            # Certificate
            type: certificate
            config:
                cert-file: /path/to/cert/file
                key-file: /path/to/key/file
    
            # JWT
            type: jwt
            config:
                token: ${JWT_TOKEN}
    
            # OAuth2 (no config required)
            type: oauth2
    
            # Kerberos
            type: kerberos
            config:
                config-file: /path/to/kerberos/config/file
                service-name: foo
                mutual-authentication: true
                force-preemptive: true
                hostname-override: custom-hostname
                sanitize-mutual-error-response: true
                principal: principal-name
                delegate: true
                ca_bundle: /path/to/ca/bundle/file
    online_store:
    	path: data/online_store.db
    # Prevents "Unsupported Hive type: timestamp(3) with time zone" TrinoUserError
    coerce_tz_aware: false
    entity_key_serialization_version: 3
    auth:
    	type: no_auth
    feature_store.yaml
    registry:
      registry_store_type: AzureRegistryStore
      path: ${REGISTRY_PATH} # Environment Variable
    project: production
    provider: azure
    online_store:
        type: redis
        connection_string: ${REDIS_CONN} # Environment Variable
    offline_store:
        type: mssql
        connection_string: ${SQL_CONN}  # Environment Variable

    export as Spark dataframe

    no

    local execution of Python-based on-demand transforms

    yes

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    no

    preview the query plan before execution

    yes

    read partitioned data

    yes

    local execution of Python-based on-demand transforms

    no

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    Redshift

    Description

    The Redshift offline store provides support for reading RedshiftSources.

    • All joins happen within Redshift.

    • Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. A Pandas dataframes will be uploaded to Redshift temporarily in order to complete join operations.

    Getting started

    In order to use this offline store, you'll need to run pip install 'feast[aws]'. You can get started by then running feast init -t aws.

    Example

    The full set of configuration options is available in .

    Functionality Matrix

    The set of functionality supported by offline stores is described in detail . Below is a matrix indicating which functionality is supported by the Redshift offline store.

    Redshift

    Below is a matrix indicating which functionality is supported by RedshiftRetrievalJob.

    Redshift

    To compare this set of functionality against other offline stores, please see the full .

    Permissions

    Feast requires the following permissions in order to execute commands for Redshift offline store:

    The following inline policy can be used to grant Feast the necessary permissions:

    In addition to this, Redshift offline store requires an IAM role that will be used by Redshift itself to interact with S3. More concretely, Redshift has to use this IAM role to run and commands. Once created, this IAM role needs to be configured in feature_store.yaml file as offline_store: iam_role.

    The following inline policy can be used to grant Redshift necessary permissions to access S3:

    While the following trust relationship is necessary to make sure that Redshift, and only Redshift can assume this role:

    Redshift Serverless

    In order to use , specify a workgroup instead of a cluster_id and user.

    Please note that the IAM policies above will need the version, rather than the standard .

    no

    local execution of Python-based on-demand transforms

    yes

    remote execution of Python-based on-demand transforms

    no

    persist results in the offline store

    yes

    preview the query plan before execution

    yes

    read partitioned data

    yes

    redshift-data:ExecuteStatement

    redshift:GetClusterCredentials

    arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

    arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

    arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

    Get Historical Features

    redshift-data:DescribeStatement

    *

    Get Historical Features

    s3:ListBucket

    s3:GetObject

    s3:PutObject

    s3:DeleteObject

    arn:aws:s3:::<bucket_name>

    arn:aws:s3:::<bucket_name>/*

    get_historical_features (point-in-time correct join)

    yes

    pull_latest_from_table_or_query (retrieve latest feature values)

    yes

    pull_all_from_table_or_query (retrieve a saved dataset)

    yes

    offline_write_batch (persist dataframes to offline store)

    yes

    write_logged_features (persist logged features to offline store)

    yes

    export to dataframe

    yes

    export to arrow table

    yes

    export to arrow batches

    yes

    export to SQL

    yes

    export to data lake (S3, GCS, etc.)

    no

    export to data warehouse

    yes

    Command

    Permissions

    Resources

    Apply

    redshift-data:DescribeTable

    redshift:GetClusterCredentials

    arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>

    arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>

    arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

    Materialize

    redshift-data:ExecuteStatement

    arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>

    Materialize

    redshift-data:DescribeStatement

    *

    Materialize

    s3:ListBucket

    s3:GetObject

    s3:DeleteObject

    arn:aws:s3:::<bucket_name>

    arn:aws:s3:::<bucket_name>/*

    RedshiftOfflineStoreConfig
    here
    functionality matrix
    UNLOAD
    COPY
    AWS Redshift Serverless
    redshift-serverless
    redshift

    export as Spark dataframe

    Get Historical Features

    feature_store.yaml
    project: my_feature_repo
    registry: data/registry.db
    provider: aws
    offline_store:
      type: redshift
      region: us-west-2
      cluster_id: feast-cluster
      database: feast-database
      user: redshift-user
      s3_staging_location: s3://feast-bucket/redshift
      iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role
    {
        "Statement": [
            {
                "Action": [
                    "s3:ListBucket",
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:DeleteObject"
                ],
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:s3:::<bucket_name>/*",
                    "arn:aws:s3:::<bucket_name>"
                ]
            },
            {
                "Action": [
                    "redshift-data:DescribeTable",
                    "redshift:GetClusterCredentials",
                    "redshift-data:ExecuteStatement"
                ],
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:redshift:<region>:<account_id>:dbuser:<redshift_cluster_id>/<redshift_username>",
                    "arn:aws:redshift:<region>:<account_id>:dbname:<redshift_cluster_id>/<redshift_database_name>",
                    "arn:aws:redshift:<region>:<account_id>:cluster:<redshift_cluster_id>"
                ]
            },
            {
                "Action": [
                    "redshift-data:DescribeStatement"
                ],
                "Effect": "Allow",
                "Resource": "*"
            }
        ],
        "Version": "2012-10-17"
    }
    {
        "Statement": [
            {
                "Action": "s3:*",
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:s3:::feast-int-bucket",
                    "arn:aws:s3:::feast-int-bucket/*"
                ]
            }
        ],
        "Version": "2012-10-17"
    }
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    feature_store.yaml
    project: my_feature_repo
    registry: data/registry.db
    provider: aws
    offline_store:
      type: redshift
      region: us-west-2
      workgroup: feast-workgroup
      database: feast-database
      s3_staging_location: s3://feast-bucket/redshift
      iam_role: arn:aws:iam::123456789012:role/redshift_s3_access_role

    Ray (contrib)

    ⚠️ Contrib Plugin: The Ray offline store is a contributed plugin. It may not be as stable or fully supported as core offline stores. Use with caution in production and report issues to the Feast community.

    The Ray offline store is a data I/O implementation that leverages Ray for reading and writing data from various sources. It focuses on efficient data access operations, while complex feature computation is handled by the Ray Compute Engine.

    Overview

    The Ray offline store provides:

    • Ray-based data reading from file sources (Parquet, CSV, etc.)

    • Support for both local and distributed Ray clusters

    • Integration with various storage backends (local files, S3, GCS, HDFS)

    • Efficient data filtering and column selection

    • Timestamp-based data processing with timezone awareness

    Functionality Matrix

    Method
    Supported
    RetrievalJob Feature
    Supported

    ⚠️ Important: Resource Management

    By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.

    For testing and local experimentation, we strongly recommend:

    1. Configure resource limits in your feature_store.yaml (see section below)

    This will limit Ray to safe resource levels for testing and development.

    Architecture

    The Ray offline store follows Feast's architectural separation:

    • Ray Offline Store: Handles data I/O operations (reading/writing data)

    • Ray Compute Engine: Handles complex feature computation and joins

    • Clear Separation: Each component has a single responsibility

    For complex feature processing, historical feature retrieval, and distributed joins, use the .

    Configuration

    The Ray offline store can be configured in your feature_store.yaml file. Below are two main configuration patterns:

    Basic Ray Offline Store

    For simple data I/O operations without distributed processing:

    Ray Offline Store + Compute Engine

    For distributed feature processing with advanced capabilities:

    Local Development Configuration

    For local development and testing:

    Production Configuration

    For production deployments with distributed Ray cluster:

    Configuration Options

    Ray Offline Store Options

    Option
    Type
    Default
    Description

    Ray Compute Engine Options

    For Ray compute engine configuration options, see the .

    Resource Management and Testing

    Overview

    By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.

    Resource Configuration

    For custom resource control, configure limits in your feature_store.yaml:

    Conservative Settings (Local Development/Testing)

    Production Settings

    Resource Configuration Options

    Setting
    Default
    Description
    Testing Recommendation

    Environment-Specific Recommendations

    Local Development

    Production Clusters

    Usage Examples

    Basic Data Source Reading

    Direct Data Access

    The Ray offline store provides direct access to underlying data:

    Batch Writing

    The Ray offline store supports batch writing for materialization:

    Saved Dataset Persistence

    The Ray offline store supports persisting datasets for later analysis:

    Remote Storage Support

    The Ray offline store supports various remote storage backends:

    Using Ray Cluster

    To use Ray in cluster mode for distributed data access:

    1. Start a Ray cluster:

    1. Configure your feature_store.yaml:

    1. For multiple worker nodes:

    Data Source Validation

    The Ray offline store validates data sources to ensure compatibility:

    Limitations

    The Ray offline store has the following limitations:

    1. File Sources Only: Currently supports only FileSource data sources

    2. No Direct SQL: Does not support SQL query interfaces

    3. No Online Writes: Cannot write directly to online stores

    4. No Complex Transformations

    Integration with Ray Compute Engine

    For complex feature processing operations, use the Ray offline store in combination with the . See the Ray Offline Store + Compute Engine configuration example in the section above for a complete setup.

    For more advanced troubleshooting, refer to the .

    Quick Reference

    Configuration Templates

    Basic Ray Offline Store (local development):

    Ray Offline Store + Compute Engine (distributed processing):

    Key Commands

    For complete examples, see the section above.

    ray_conf

    dict

    None

    Ray initialization parameters for resource management (e.g., memory, CPU limits)

    enable_ray_logging

    false

    Enable Ray progress bars and logging

    false

    : The Ray offline store focuses on data I/O operations. For complex feature transformations (aggregations, joins, custom UDFs), use the
    instead

    get_historical_features

    Yes

    pull_latest_from_table_or_query

    Yes

    pull_all_from_table_or_query

    Yes

    offline_write_batch

    Yes

    write_logged_features

    Yes

    export to dataframe

    Yes

    export to arrow table

    Yes

    persist results in offline store

    Yes

    local execution of ODFVs

    Yes

    preview query plan

    Yes

    read partitioned data

    Yes

    type

    string

    Required

    Must be feast.offline_stores.contrib.ray_offline_store.ray.RayOfflineStore or ray

    storage_path

    string

    None

    Path for storing temporary files and datasets

    ray_address

    string

    None

    broadcast_join_threshold_mb

    100

    Size threshold for broadcast joins (MB)

    25

    max_parallelism_multiplier

    2

    Parallelism as multiple of CPU cores

    1

    target_partition_size_mb

    64

    Target partition size (MB)

    Resource Management and Testing
    Ray Compute Engine
    Ray Compute Engine documentation
    Ray Compute Engine
    Configuration
    Ray documentation
    Configuration

    Address of the Ray cluster (e.g., "localhost:10001")

    16

    Ray Compute Engine
    project: my_project
    registry: data/registry.db
    provider: local
    offline_store:
        type: ray
        storage_path: data/ray_storage        # Optional: Path for storing datasets
        ray_address: localhost:10001          # Optional: Ray cluster address
    project: my_project
    registry: data/registry.db
    provider: local
    
    # Ray offline store for data I/O operations
    offline_store:
        type: ray
        storage_path: s3://my-bucket/feast-data    # Optional: Path for storing datasets
        ray_address: localhost:10001               # Optional: Ray cluster address
    
    # Ray compute engine for distributed feature processing
    batch_engine:
        type: ray.engine
        
        # Resource configuration
        max_workers: 8                             # Maximum number of Ray workers
        max_parallelism_multiplier: 2              # Parallelism as multiple of CPU cores
        
        # Performance optimization
        enable_optimization: true                  # Enable performance optimizations
        broadcast_join_threshold_mb: 100           # Broadcast join threshold (MB)
        target_partition_size_mb: 64               # Target partition size (MB)
        
        # Distributed join configuration
        window_size_for_joins: "1H"                # Time window for distributed joins
        enable_distributed_joins: true            # Enable distributed joins
        
        # Ray cluster configuration (optional)
        ray_address: localhost:10001               # Ray cluster address
        staging_location: s3://my-bucket/staging   # Remote staging location
    project: my_local_project
    registry: data/registry.db
    provider: local
    
    offline_store:
        type: ray
        storage_path: ./data/ray_storage
        # Conservative settings for local development
        broadcast_join_threshold_mb: 25
        max_parallelism_multiplier: 1
        target_partition_size_mb: 16
        enable_ray_logging: false
        # Memory constraints to prevent OOM in test/development environments
        ray_conf:
            num_cpus: 1
            object_store_memory: 104857600  # 100MB
            _memory: 524288000              # 500MB
    
    batch_engine:
        type: ray.engine
        max_workers: 2
        enable_optimization: false
    project: my_production_project
    registry: s3://my-bucket/registry.db
    provider: local
    
    offline_store:
        type: ray
        storage_path: s3://my-production-bucket/feast-data
        ray_address: "ray://production-head-node:10001"
    
    batch_engine:
        type: ray.engine
        max_workers: 32
        max_parallelism_multiplier: 4
        enable_optimization: true
        broadcast_join_threshold_mb: 50
        target_partition_size_mb: 128
        window_size_for_joins: "30min"
        ray_address: "ray://production-head-node:10001"
        staging_location: s3://my-production-bucket/staging
    offline_store:
        type: ray
        storage_path: ./data/ray_storage
        # Resource optimization settings
        broadcast_join_threshold_mb: 25        # Smaller datasets for broadcast joins
        max_parallelism_multiplier: 1          # Reduced parallelism  
        target_partition_size_mb: 16           # Smaller partition sizes
        enable_ray_logging: false              # Disable verbose logging
        # Memory constraints to prevent OOM in test environments
        ray_conf:
            num_cpus: 1
            object_store_memory: 104857600      # 100MB
            _memory: 524288000                  # 500MB
    offline_store:
        type: ray
        storage_path: s3://my-bucket/feast-data
        ray_address: "ray://production-cluster:10001"
        # Optimized for production workloads
        broadcast_join_threshold_mb: 100
        max_parallelism_multiplier: 2
        target_partition_size_mb: 64
        enable_ray_logging: true
    # feature_store.yaml
    offline_store:
        type: ray
        broadcast_join_threshold_mb: 25
        max_parallelism_multiplier: 1
        target_partition_size_mb: 16
    # feature_store.yaml  
    offline_store:
        type: ray
        ray_address: "ray://cluster-head:10001"
        broadcast_join_threshold_mb: 200
        max_parallelism_multiplier: 4
    from feast import FeatureStore, FeatureView, FileSource
    from feast.types import Float32, Int64
    from datetime import timedelta
    
    # Define a feature view
    driver_stats = FeatureView(
        name="driver_stats",
        entities=["driver_id"],
        ttl=timedelta(days=1),
        source=FileSource(
            path="data/driver_stats.parquet",
            timestamp_field="event_timestamp",
        ),
        schema=[
            ("driver_id", Int64),
            ("avg_daily_trips", Float32),
        ],
    )
    
    # Initialize feature store
    store = FeatureStore("feature_store.yaml")
    
    # The Ray offline store handles data I/O operations
    # For complex feature computation, use Ray Compute Engine
    from feast.infra.offline_stores.contrib.ray_offline_store.ray import RayOfflineStore
    from datetime import datetime, timedelta
    
    # Pull latest data from a table
    job = RayOfflineStore.pull_latest_from_table_or_query(
        config=store.config,
        data_source=driver_stats.source,
        join_key_columns=["driver_id"],
        feature_name_columns=["avg_daily_trips"],
        timestamp_field="event_timestamp",
        created_timestamp_column=None,
        start_date=datetime.now() - timedelta(days=7),
        end_date=datetime.now(),
    )
    
    # Convert to pandas DataFrame
    df = job.to_df()
    print(f"Retrieved {len(df)} rows")
    
    # Convert to Arrow Table
    arrow_table = job.to_arrow()
    
    # Get Ray dataset directly
    ray_dataset = job.to_ray_dataset()
    import pyarrow as pa
    from feast import FeatureView
    
    # Create sample data
    data = pa.table({
        "driver_id": [1, 2, 3, 4, 5],
        "avg_daily_trips": [10.5, 15.2, 8.7, 12.3, 9.8],
        "event_timestamp": [datetime.now()] * 5
    })
    
    # Write batch data
    RayOfflineStore.offline_write_batch(
        config=store.config,
        feature_view=driver_stats,
        table=data,
        progress=lambda x: print(f"Wrote {x} rows")
    )
    from feast.infra.offline_stores.file_source import SavedDatasetFileStorage
    
    # Create storage destination
    storage = SavedDatasetFileStorage(path="data/training_dataset.parquet")
    
    # Persist the dataset
    job.persist(storage, allow_overwrite=False)
    
    # Create a saved dataset in the registry
    saved_dataset = store.create_saved_dataset(
        from_=job,
        name="driver_training_dataset",
        storage=storage,
        tags={"purpose": "data_access", "version": "v1"}
    )
    
    print(f"Saved dataset created: {saved_dataset.name}")
    # S3 storage
    s3_storage = SavedDatasetFileStorage(path="s3://my-bucket/datasets/driver_features.parquet")
    job.persist(s3_storage, allow_overwrite=True)
    
    # Google Cloud Storage
    gcs_storage = SavedDatasetFileStorage(path="gs://my-project-bucket/datasets/driver_features.parquet")
    job.persist(gcs_storage, allow_overwrite=True)
    
    # HDFS
    hdfs_storage = SavedDatasetFileStorage(path="hdfs://namenode:8020/datasets/driver_features.parquet")
    job.persist(hdfs_storage, allow_overwrite=True)
    ray start --head --port=10001
    offline_store:
        type: ray
        ray_address: localhost:10001
        storage_path: s3://my-bucket/features
    # On worker nodes
    ray start --address='head-node-ip:10001'
    from feast.infra.offline_stores.contrib.ray_offline_store.ray import RayOfflineStore
    
    # Validate a data source
    try:
        RayOfflineStore.validate_data_source(store.config, driver_stats.source)
        print("Data source is valid")
    except Exception as e:
        print(f"Data source validation failed: {e}")
    offline_store:
        type: ray
        storage_path: ./data/ray_storage
        # Conservative settings for local development
        broadcast_join_threshold_mb: 25
        max_parallelism_multiplier: 1
        target_partition_size_mb: 16
        enable_ray_logging: false
    offline_store:
        type: ray
        storage_path: s3://my-bucket/feast-data
        
    batch_engine:
        type: ray.engine
        max_workers: 8
        enable_optimization: true
        broadcast_join_threshold_mb: 100
    # Initialize feature store
    store = FeatureStore("feature_store.yaml")
    
    # Get historical features (uses compute engine if configured)
    features = store.get_historical_features(entity_df=df, features=["fv:feature"])
    
    # Direct data access (uses offline store)
    job = RayOfflineStore.pull_latest_from_table_or_query(...)
    df = job.to_df()
    
    # Offline write batch (materialization)
    # Create sample data for materialization
    data = pa.table({
        "driver_id": [1, 2, 3, 4, 5],
        "avg_daily_trips": [10.5, 15.2, 8.7, 12.3, 9.8],
        "event_timestamp": [datetime.now()] * 5
    })
    
    # Write batch to offline store
    RayOfflineStore.offline_write_batch(
        config=store.config,
        feature_view=driver_stats_fv,
        table=data,
        progress=lambda rows: print(f"Processed {rows} rows")
    )