Search code examples
google-cloud-platformgoogle-cloud-dataprep

DataPrep: access to source filename


Is there a way to create a column with filename of the source that created each row ?

Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.

I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.

UPDATED: I have opened a feature request with Google.


Solution

  • While they haven't closed that issue yet, this was part of the update last week.

    There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.

    There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot, join, or unnest . . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.

    NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).

    Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148