LogoLogo
v0.18-branch
v0.18-branch
  • Introduction
  • Community
  • Roadmap
  • Changelog
  • Getting started
    • Quickstart
    • Concepts
      • Overview
      • Data source
      • Entity
      • Feature view
      • Feature service
      • Feature retrieval
      • Point-in-time joins
      • Dataset
    • Architecture
      • Overview
      • Feature repository
      • Registry
      • Offline store
      • Online store
      • Provider
    • Third party integrations
    • FAQ
  • Tutorials
    • Overview
    • Driver ranking
    • Fraud detection on GCP
    • Real-time credit scoring on AWS
    • Driver stats on Snowflake
    • Validating historical features with Great Expectations
  • How-to Guides
    • Running Feast with Snowflake/GCP/AWS
      • Install Feast
      • Create a feature repository
      • Deploy a feature store
      • Build a training dataset
      • Load data into the online store
      • Read features from the online store
    • Running Feast in production
    • Deploying a Java feature server on Kubernetes
    • Upgrading from Feast 0.9
    • Adding a custom provider
    • Adding a new online store
    • Adding a new offline store
    • Adding or reusing tests
  • Reference
    • Data sources
      • File
      • Snowflake
      • BigQuery
      • Redshift
    • Offline stores
      • File
      • Snowflake
      • BigQuery
      • Redshift
    • Online stores
      • SQLite
      • Redis
      • Datastore
      • DynamoDB
    • Providers
      • Local
      • Google Cloud Platform
      • Amazon Web Services
    • Feature repository
      • feature_store.yaml
      • .feastignore
    • Feature servers
      • Local feature server
    • [Alpha] Data quality monitoring
    • [Alpha] On demand feature view
    • [Alpha] Stream ingestion
    • [Alpha] AWS Lambda feature server
    • Feast CLI reference
    • Python API reference
    • Usage
  • Project
    • Contribution process
    • Development guide
    • Versioning policy
    • Release process
    • Feast 0.9 vs Feast 0.10+
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
Export as PDF
  1. Getting started
  2. Concepts

Dataset

PreviousPoint-in-time joinsNextArchitecture

Last updated 3 years ago

Was this helpful?

Feast datasets allow for conveniently saving dataframes that include both features and entities to be subsequently used for data analysis and model training. was the primary motivation for creating dataset concept.

Dataset's metadata is stored in the Feast registry and raw data (features, entities, additional input keys and timestamp) is stored in the .

Dataset can be created from:

  1. Results of historical retrieval

  2. [planned] Logging request (including input for ) and response during feature serving

  3. [planned] Logging features during writing to online store (from batch source or stream)

Creating Saved Dataset from Historical Retrieval

To create a saved dataset from historical features for later retrieval or analysis, a user needs to call get_historical_features method first and then pass the returned retrieval job to create_saved_dataset method. create_saved_dataset will trigger provided retrieval job (by calling .persist() on it) to store the data using specified storage. Storage type must be the same as globally configured offline store (eg, it's impossible to persist data to Redshift with BigQuery source). create_saved_dataset will also create SavedDataset object with all related metadata and will write it to the registry.

from feast import FeatureStore
from feast.infra.offline_stores.bigquery_source import SavedDatasetBigQueryStorage

store = FeatureStore()

historical_job = store.get_historical_features(
    features=["driver:avg_trip"],
    entity_df=...,
)

dataset = store.create_saved_dataset(
    from_=historical_job,
    name='my_training_dataset',
    storage=SavedDatasetBigQueryStorage(table_ref='<gcp-project>.<gcp-dataset>.my_training_dataset'),
    tags={'author': 'oleksii'}
)

dataset.to_df()

Saved dataset can be later retrieved using get_saved_dataset method:

dataset = store.get_saved_dataset('my_training_dataset')
dataset.to_df()

Check out our to see how this concept can be applied in real-world use case.

Data Quality Monitoring
offline store
on demand transformation
tutorial on validating historical features