Is there a way to create a column with filename of the source that created each row ?
Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.
I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.
UPDATED: I have opened a feature request with Google.
While they haven't closed that issue yet, this was part of the update last week.
There's now a source metadata reference called $filepath
—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.
There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot
, join
, or unnest
. . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.
NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148