In this tutorial, we will use the public dataset of Chicago taxi trips to present data validation capabilities of Feast.
The original dataset is stored in BigQuery and consists of raw data for each taxi trip (one row per trip) since 2013.
We will generate several training datasets (aka historical features in Feast) for different periods and evaluate expectations made on one dataset against another.
Types of features we're ingesting and generating:
Features that aggregate raw data with daily intervals (eg, trips per day, average fare or speed for a specific day, etc.).
Features using SQL while pulling data from BigQuery (like total trips time or total miles travelled).
Features calculated on the fly when requested using Feast's on-demand transformations
Our plan:
Prepare environment
Pull data from BigQuery (optional)
Declare & apply features and feature views in Feast
Generate reference dataset
Develop & test profiler function
Run validation on different dataset using reference dataset & profiler
The original notebook and datasets for this tutorial can be found on GitHub.
Install Feast Python SDK and great expectations:
You can skip this step if you don't have GCP account. Please use parquet files that are coming with this tutorial instead
Running some basic aggregations while pulling data from BigQuery. Grouping by taxi_id and day:
Read more about feature views in Feast docs
Read more about on demand feature views here
Generating range of timestamps with daily frequency:
Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
156984 rows × 2 columns
Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:
Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.
Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.
Loading saved dataset first and exploring the data:
156984 rows × 10 columns
Feast uses Great Expectations as a validation engine and ExpectationSuite as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of PandasDataset (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.
Testing our profiler function:
Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).
Now we can create validation reference from dataset and profiler function:
and test it against our existing retrieval job
Validation successfully passed as no exception were raised.
Creating new timestamps for Dec 2020:
35448 rows × 2 columns
Execute retrieval job with validation reference:
Validation failed since several expectations didn't pass:
Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)
Average Fare increased - all quantiles are higher than expected
Earn per hour (mean) increased more than 10% (most probably due to increased fare)
taxi_id | event_timestamp | |
---|---|---|
total_earned | avg_trip_seconds | taxi_id | total_miles_travelled | trip_count | earned_per_hour | event_timestamp | total_trip_seconds | avg_fare | avg_speed | |
---|---|---|---|---|---|---|---|---|---|---|
taxi_id | event_timestamp | |
---|---|---|
0
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-01
1
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-02
2
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-03
3
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-04
4
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2019-06-05
...
...
...
156979
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-27
156980
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-28
156981
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-29
156982
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-06-30
156983
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2019-07-01
0
68.25
2270.000000
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
24.70
2.0
54.118943
2019-06-01 00:00:00+00:00
4540.0
34.125000
19.585903
1
221.00
560.500000
7a4a6162eaf27805aef407d25d5cb21fe779cd962922cb...
54.18
24.0
59.143622
2019-06-01 00:00:00+00:00
13452.0
9.208333
14.499554
2
160.50
1010.769231
f4c9d05b215d7cbd08eca76252dae51cdb7aca9651d4ef...
41.30
13.0
43.972603
2019-06-01 00:00:00+00:00
13140.0
12.346154
11.315068
3
183.75
697.550000
c1f533318f8480a59173a9728ea0248c0d3eb187f4b897...
37.30
20.0
47.415956
2019-06-01 00:00:00+00:00
13951.0
9.187500
9.625116
4
217.75
1054.076923
455b6b5cae6ca5a17cddd251485f2266d13d6a2c92f07c...
69.69
13.0
57.206451
2019-06-01 00:00:00+00:00
13703.0
16.750000
18.308692
...
...
...
...
...
...
...
...
...
...
...
156979
38.00
1980.000000
0cccf0ec1f46d1e0beefcfdeaf5188d67e170cdff92618...
14.90
1.0
69.090909
2019-07-01 00:00:00+00:00
1980.0
38.000000
27.090909
156980
135.00
551.250000
beefd3462e3f5a8e854942a2796876f6db73ebbd25b435...
28.40
16.0
55.102041
2019-07-01 00:00:00+00:00
8820.0
8.437500
11.591837
156981
NaN
NaN
9a3c52aa112f46cf0d129fafbd42051b0fb9b0ff8dcb0e...
NaN
NaN
NaN
2019-07-01 00:00:00+00:00
NaN
NaN
NaN
156982
63.00
815.000000
08308c31cd99f495dea73ca276d19a6258d7b4c9c88e43...
19.96
4.0
69.570552
2019-07-01 00:00:00+00:00
3260.0
15.750000
22.041718
156983
NaN
NaN
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
NaN
NaN
NaN
2019-07-01 00:00:00+00:00
NaN
NaN
NaN
0
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-01
1
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-02
2
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-03
3
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-04
4
91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...
2020-12-05
...
...
...
35443
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-03
35444
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-04
35445
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-05
35446
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-06
35447
7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...
2020-12-07