Search code examples
pythontensorflowtensorflow-data-validation

TensorFlow Data Validation - How to return the rows with anomalies


Tensorflow Data Validation provides a way to find anomalies in your data.

However, I am able to find only a way to provide a summarized version of the anomalies (by using tfdv.validate_statistics and tfdv.display_anomalies).

Is there a functionality of some param to pass that instead of reporting the summary, it returns the rows with the anomaly and what anomaly type?

Following the example below:

import pandas as pd
import tensorflow_data_validation as tfdv
from tensorflow_metadata.proto import schema_pb2


df_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=df_stats)
tfdv.set_domain(schema, "c1", schema_pb2.IntDomain(min=1, max=3))
anomalies = tfdv.validate_statistics(statistics=df_stats, schema=schema)
tfdv.display_anomalies(anomalies)

Is there a way to leverage TFDV to return something like:

index c1 c2 anomaly_type
3 100 Z c1 Out-of-range values
4 100000 A c1 Out-of-range values

If not, what alternative would you recommend?


Solution

  • No you can not. that's because it is the stats that are being validated and not the actual data. For the c1 column, tfdv compare min and max values found in stats with min and max values found in schema. that implies :

    • tfdv is unaware if there is other values that are out of range (eg. 100)
    • tfdv cannot return the index of the rows where the anomaly has been detected since it does not have this information

    check this for more : https://www.tensorflow.org/tfx/data_validation/anomalies?hl=en