Statistics

Data is a first-class citizen in machine learning projects, it is critical to have tests and validations around data. To that end, Feast avails various feature statistics to users in order to give users visibility into the data that has been ingested into the system.

overview

Feast exposes feature statistics at two points in the Feast system: 1. Inflight feature statistics from the population job 2. Historical feature statistics from the warehouse stores

Historical Feature Statistics

Feast supports the computation of feature statistics over data already written to warehouse stores. These feature statistics, which can be retrieved over distinct sets of historical data, are fully compatible with TFX's Data Validation.

Retrieving Statistics

Statistics can be retrieved from Feast using the python SDK's get_statistics method. This requires a connection to Feast core.

Feature statistics can be retrieved for a single feature set, from a single valid warehouse store. Users can opt to either retrieve feature statistics for a discrete subset of data by providing an ingestion_id , a unique id generated for a dataset when it is ingested into feast:

# A unique ingestion id is returned for each batch ingestion
ingestion_id=client.ingest(feature_set,df)
stats = client.get_statistics(
feature_set_id='project/feature_set',
store='warehouse',
features=['feature_1', 'feature_2'],
ingestion_ids=[ingestion_id])

Or by selecting data within a time range by providing a start_date and end_date (the start date is inclusive, the end date is not):

start_date=datetime(2020,10,1,0,0,0)
end_date=datetime(2020,10,2,0,0,0)
stats = client.get_statistics(
feature_set_id = 'project/feature_set',
store='warehouse',
features=['feature_1', 'feature_2'],
start_date=start_date,
end_date=end_date)

Although get_statistics accepts python datetime objects for start_date and end_date, statistics are computed at the day granularity.

Note that when providing a time range, Feast will NOT filter out duplicated rows. It is therefore highly recommended to provide ingestion_ids whenever possible.

Feast returns the statistics in the form of the protobuf DatasetFeatureStatisticsList, which can be subsequently passed to TFDV methods to validate the dataset...

anomalies = tfdv.validate_statistics(
statistics=stats_2, schema=feature_set.export_tfx_schema())
tfdv.display_anomalies(anomalies)

Or visualise the statistics in facets.

tfdv.visualize_statistics(stats)

Refer to the example notebook for an end-to-end example showcasing Feast's integration with TFDV and Facets.

Aggregating Statistics

Feast supports retrieval of feature statistics across multiple datasets or days.

stats = client.get_statistics(
feature_set_id='project/feature_set',
store='warehouse',
features=['feature_1', 'feature_2'],
ingestion_ids=[ingestion_id_1, ingestion_id_2])

However, when querying across multiple datasets, Feast computes the statistics for each dataset independently (for caching purposes), and aggregates the results. As a result of this, certain un-aggregatable statistics are dropped in the process, such as medians, uniqueness counts, and histograms.

Refer to the table below for the list of statistics that will be dropped.

Caching

Feast caches the results of all feature statistics requests, and will, by default, retrieve and return the cached results. To recompute previously computed feature statistics, set force_refresh to true when retrieving the statistics:

stats=client.get_statistics(
feature_set_id='project/feature_set',
store='warehouse',
features=['feature_1', 'feature_2'],
dataset_ids=[dataset_id],
force_refresh=True)

This will force Feast to recompute the statistics, and replace any previously cached values.

Supported Statistics

Feast supports most, but not all of the feature statistics defined in TFX's FeatureNameStatistics. For the definition of each statistic and information about how each one is computed, refer to the protobuf definition.

Type

Statistic

Supported

Aggregateable

Common

NumNonMissing

NumMissing

MinNumValues

MaxNumValues

AvgNumValues

TotalNumValues

NumValuesHist

Numeric

Min

Max

Median

Mean

Stdev

NumZeroes

Quantiles

Histogram

String

RankHistogram

TopValues

Unique

AvgLength

Bytes

MinNumBytes

MaxNumBytes

AvgNumBytes

Unique

Struct/List

- (uses common statistics only)

-

-

Inflight Feature Statistics

For insight into data currently flowing into Feast through the population jobs, statsd is used to capture feature value statistics.

Inflight feature statistics are windowed (default window length is 30s) and computed at two points in the feature population pipeline:

  1. Prior to store writes, after successful validation

  2. After successful store writes

The following metrics are written at the end of each window as statsd gauges:

feast_ingestion_feature_value_min
feast_ingestion_feature_value_max
feast_ingestion_feature_value_mean
feast_ingestion_feature_value_percentile_25
feast_ingestion_feature_value_percentile_50
feast_ingestion_feature_value_percentile_90
feast_ingestion_feature_value_percentile_95
feast_ingestion_feature_value_percentile_99

the gauge metric type is used over histogram because statsd only supports positive values for histogram metric types, while numerical feature values can be of any double value.

The metrics are tagged with and can be aggregated by the following keys:

key

description

feast_store

store the population job is writing to

feast_project_name

feast project name

feast_featureSet_name

feature set name

feast_feature_name

feature name

ingestion_job_name

id of the population job writing the feature values.

metrics_namespace

either Inflight or WriteToStoreSuccess