Search code examples
google-bigquerygoogle-cloud-dataflowgoogle-cloud-dataprep

Add dataset parameters into column to use them in BigQuery later with DataPrep


I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:

//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv

We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.

However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.

How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?


Solution

  • While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.

    In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.

    There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.

    NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).

    Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148