I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month
in it.
How could I therefore add these Dataset Parameters (here: {month}
) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER()
(which can now also be expressed as $sourcerownumber
), there's now also a source metadata reference called $filepath
—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot
, join
, or unnest
. . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148