Ray (contrib)
⚠️ Contrib Plugin: The Ray offline store is a contributed plugin. It may not be as stable or fully supported as core offline stores. Use with caution in production and report issues to the Feast community.
The Ray offline store is a data I/O implementation that leverages Ray for reading and writing data from various sources. It focuses on efficient data access operations, while complex feature computation is handled by the Ray Compute Engine.
Quick Start with Ray Template
The easiest way to get started with Ray offline store is to use the built-in Ray template:
feast init -t ray my_ray_project
cd my_ray_project/feature_repoThis template includes:
Pre-configured Ray offline store and compute engine setup
Sample feature definitions optimized for Ray processing
Demo workflow showcasing Ray capabilities
Resource settings for local development
The template provides a complete working example with sample datasets and demonstrates both Ray offline store data I/O operations and Ray compute engine distributed processing.
Overview
The Ray offline store provides:
Ray-based data reading from file sources (Parquet, CSV, etc.)
Support for local, remote, and KubeRay (Kubernetes-managed) clusters
Integration with various storage backends (local files, S3, GCS, HDFS)
Efficient data filtering and column selection
Timestamp-based data processing with timezone awareness
Enterprise-ready KubeRay cluster support via CodeFlare SDK
Functionality Matrix
get_historical_features
Yes
pull_latest_from_table_or_query
Yes
pull_all_from_table_or_query
Yes
offline_write_batch
Yes
write_logged_features
Yes
export to dataframe
Yes
export to arrow table
Yes
persist results in offline store
Yes
local execution of ODFVs
Yes
preview query plan
Yes
read partitioned data
Yes
⚠️ Important: Resource Management
By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.
For testing and local experimentation, we strongly recommend:
Configure resource limits in your
feature_store.yaml(see Resource Management and Testing section below)
This will limit Ray to safe resource levels for testing and development.
Architecture
The Ray offline store follows Feast's architectural separation:
Ray Offline Store: Handles data I/O operations (reading/writing data)
Ray Compute Engine: Handles complex feature computation and joins
Clear Separation: Each component has a single responsibility
For complex feature processing, historical feature retrieval, and distributed joins, use the Ray Compute Engine.
Configuration
The Ray offline store can be configured in your feature_store.yaml file. It supports three execution modes:
LOCAL: Ray runs locally on the same machine (default)
REMOTE: Connects to a remote Ray cluster via
ray_addressKUBERAY: Connects to Ray clusters on Kubernetes via CodeFlare SDK
Execution Modes
Local Mode (Default)
For simple data I/O operations without distributed processing:
Remote Ray Cluster
Connect to an existing Ray cluster:
KubeRay Cluster (Kubernetes)
Connect to Ray clusters on Kubernetes using CodeFlare SDK:
Environment Variables (alternative to config file):
Ray Offline Store + Compute Engine
For distributed feature processing with advanced capabilities:
Local Development Configuration
For local development and testing:
Production Configuration
For production deployments with distributed Ray cluster:
Configuration Options
Ray Offline Store Options
type
string
Required
Must be feast.offline_stores.contrib.ray_offline_store.ray.RayOfflineStore or ray
storage_path
string
None
Path for storing temporary files and datasets
ray_address
string
None
Ray cluster address (triggers REMOTE mode, e.g., "ray://host:10001")
use_kuberay
boolean
None
Enable KubeRay mode (overrides ray_address)
kuberay_conf
dict
None
KubeRay configuration dict with keys: cluster_name (required), namespace (default: "default"), auth_token, auth_server, skip_tls (default: false)
enable_ray_logging
boolean
false
Enable Ray progress bars and verbose logging
ray_conf
dict
None
Ray initialization parameters for resource management (e.g., memory, CPU limits)
broadcast_join_threshold_mb
int
100
Size threshold for broadcast joins (MB)
enable_distributed_joins
boolean
true
Enable distributed joins for large datasets
max_parallelism_multiplier
int
2
Parallelism as multiple of CPU cores
target_partition_size_mb
int
64
Target partition size (MB)
window_size_for_joins
string
"1H"
Time window for distributed joins
Mode Detection Precedence
The Ray offline store automatically detects the execution mode using the following precedence:
Environment Variables (highest priority)
FEAST_RAY_USE_KUBERAY,FEAST_RAY_CLUSTER_NAME, etc.
Config
kuberay_confIf present → KubeRay mode
Config
ray_addressIf present → Remote mode
Default
Local mode (lowest priority)
Ray Compute Engine Options
For Ray compute engine configuration options, see the Ray Compute Engine documentation.
Resource Management and Testing
Overview
By default, Ray will use all available system resources (CPU and memory). This can cause issues in test environments or when experimenting locally, potentially leading to system crashes or unresponsiveness.
Resource Configuration
For custom resource control, configure limits in your feature_store.yaml:
Conservative Settings (Local Development/Testing)
Production Settings
Resource Configuration Options
broadcast_join_threshold_mb
100
Size threshold for broadcast joins (MB)
25
max_parallelism_multiplier
2
Parallelism as multiple of CPU cores
1
target_partition_size_mb
64
Target partition size (MB)
16
enable_ray_logging
false
Enable Ray progress bars and logging
false
Environment-Specific Recommendations
Local Development
Production Clusters
Usage Examples
Basic Data Source Reading
Direct Data Access
The Ray offline store provides direct access to underlying data:
Batch Writing
The Ray offline store supports batch writing for materialization:
Saved Dataset Persistence
The Ray offline store supports persisting datasets for later analysis:
Remote Storage Support
The Ray offline store supports various remote storage backends:
Using Ray Cluster
Standard Ray Cluster
To use Ray in cluster mode for distributed data access:
Start a Ray cluster:
Configure your
feature_store.yaml:
For multiple worker nodes:
KubeRay Cluster (Kubernetes)
To use Feast with Ray clusters on Kubernetes via CodeFlare SDK:
Prerequisites:
KubeRay cluster deployed on Kubernetes
CodeFlare SDK installed:
pip install codeflare-sdkAccess credentials for the Kubernetes cluster
Configuration:
Using configuration file:
Using environment variables:
Features:
The CodeFlare SDK handles cluster connection and authentication
Automatic TLS certificate management
Authentication with Kubernetes clusters
Namespace isolation
Secure communication between client and Ray cluster
Automatic cluster discovery
Data Source Validation
The Ray offline store validates data sources to ensure compatibility:
Limitations
The Ray offline store has the following limitations:
File Sources Only: Currently supports only
FileSourcedata sourcesNo Direct SQL: Does not support SQL query interfaces
No Online Writes: Cannot write directly to online stores
No Complex Transformations: The Ray offline store focuses on data I/O operations. For complex feature transformations (aggregations, joins, custom UDFs), use the Ray Compute Engine instead
Integration with Ray Compute Engine
For complex feature processing operations, use the Ray offline store in combination with the Ray Compute Engine. See the Ray Offline Store + Compute Engine configuration example in the Configuration section above for a complete setup.
For more advanced troubleshooting, refer to the Ray documentation.
Quick Reference
Configuration Templates
Basic Ray Offline Store (local development):
Ray Offline Store + Compute Engine (distributed processing):
Key Commands
For complete examples, see the Configuration section above.
Last updated
Was this helpful?