All pages
Powered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Bytewax

Description

The Bytewax batch materialization engine provides an execution engine for batch materializing operations (materialize and materialize-incremental).

Guide

In order to use the Bytewax materialization engine, you will need a Kubernetes cluster running version 1.22.10 or greater.

Kubernetes Authentication

The Bytewax materialization engine loads authentication and cluster information from the kubeconfig file. By default, kubectl looks for a file named config in the $HOME/.kube directory. You can specify other kubeconfig files by setting the KUBECONFIG environment variable.

Resource Authentication

Bytewax jobs can be configured to access Kubernetes secrets as environment variables to access online and offline stores during job runs.

To configure secrets, first create them using kubectl:

kubectl create secret generic -n bytewax aws-credentials --from-literal=aws-access-key-id='<access key id>' --from-literal=aws-secret-access-key='<secret access key>'

Then configure them in the batch_engine section of feature_store.yaml:

batch_engine:
  type: bytewax
  namespace: bytewax
  env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-access-key-id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws-secret-access-key

Configuration

The Bytewax materialization engine is configured through the The feature_store.yaml configuration file:

batch_engine:
  type: bytewax
  namespace: bytewax
  image: bytewax/bytewax-feast:latest

The namespace configuration directive specifies which Kubernetes namespace jobs, services and configuration maps will be created in.

Building a custom Bytewax Docker image

The image configuration directive specifies which container image to use when running the materialization job. To create a custom image based on this container, run the following command:

DOCKER_BUILDKIT=1 docker build . -f ./sdk/python/feast/infra/materialization/contrib/bytewax/Dockerfile -t <image tag>

Once that image is built and pushed to a registry, it can be specified as a part of the batch engine configuration:

batch_engine:
  type: bytewax
  namespace: bytewax
  image: <image tag>

AWS Lambda (alpha)

Description

The AWS Lambda batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via to_remote_storage, and then loads them into the online store.

See LambdaMaterializationEngineConfig for configuration options.

See also Dockerfile for a Dockerfile that can be used below with materialization_image.

Example

feature_store.yaml
...
offline_store:
  type: snowflake.offline
...
batch_engine:
  type: lambda
  lambda_role: [your iam role]
  materialization_image: [image uri of above Docker image]

Spark (contrib)

Description

The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via to_remote_storage, and then loads them into the online store.

See SparkMaterializationEngine for configuration options.

Example

feature_store.yaml
...
offline_store:
  type: snowflake.offline
...
batch_engine:
  type: spark.engine
  partitions: [optional num partitions to use to write to online store]

Example in Python

feature_store.py
from feast import FeatureStore, RepoConfig
from feast.repo_config import RegistryConfig
from feast.infra.online_stores.dynamodb import DynamoDBOnlineStoreConfig
from feast.infra.offline_stores.contrib.spark_offline_store.spark import SparkOfflineStoreConfig

repo_config = RepoConfig(
    registry="s3://[YOUR_BUCKET]/feast-registry.db",
    project="feast_repo",
    provider="aws",
    offline_store=SparkOfflineStoreConfig(
      spark_conf={
        "spark.ui.enabled": "false",
        "spark.eventLog.enabled": "false",
        "spark.sql.catalogImplementation": "hive",
        "spark.sql.parser.quotedRegexColumnNames": "true",
        "spark.sql.session.timeZone": "UTC"
      }
    ),
    batch_engine={
      "type": "spark.engine",
      "partitions": 10
    },
    online_store=DynamoDBOnlineStoreConfig(region="us-west-1"),
    entity_key_serialization_version=2
)

store = FeatureStore(config=repo_config)

Batch Materialization Engines

Please see Batch Materialization Engine for an explanation of batch materialization engines.

SnowflakeBytewaxAWS Lambda (alpha)Spark (contrib)

Snowflake

Description

The Snowflake batch materialization engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (materialize and materialize-incremental) when using a SnowflakeSource.

The engine requires no additional configuration other than for you to supply Snowflake's standard login and context details. The engine leverages custom (automatically deployed for you) Python UDFs to do the proper serialization of your offline store data to your online serving tables.

When using all three options together, snowflake.offline, snowflake.engine, and snowflake.online, you get the most unique experience of unlimited scale and performance + governance and data security.

Example

feature_store.yaml
...
offline_store:
  type: snowflake.offline
...
batch_engine:
  type: snowflake.engine
  account: snowflake_deployment.us-east-1
  user: user_login
  password: user_password
  role: sysadmin
  warehouse: demo_wh
  database: FEAST