pythongoogle-cloud-platformgoogle-bigquerydml

Add column from column using Python and Spacy


I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.

I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?

And in a more general way, how do you tranform data when tools like DBT are not enough ?

Thanks for your help !


Solution

  • You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.

    Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.