google-cloud-platform google-bigquery google-cloud-storage google-cloud-functions google-cloud-dataprep

Manipulate large number of files to reformat in google cloud

I have a large amount of json files in Google cloud storage that I would like to load to Bigquery. Average file size is 5MB not compressed. The problem is that they are not new line delimited so I can't load them as is to bigquery. What's my best approach here? Should I use Google functions or data prep or just spin up a server and have it download the file, reformat it and upload it back to cloud storage and then to Bigquery?

Solution

Do not compress the data before loading into Bigquery. Another item, 5 MB is small for Bigquery. I would look at consolidation strategies and maybe changing file format while processing each Json file.

You can use Dataprep, Dataflow or even Dataproc. Depending on how many files, this may be the best choice. Anything larger than say 100,000 5 MB files will require one of these big systems with many nodes.

Cloud Functions would take too long for anything more than a few thousand files.

Another option is to write a simple Python program that preprocesses your files on Cloud Storage and directly loads them into BigQuery. We are only talking about 20 or 30 lines of code unless you add consolidation. A 5 MB file would take about 500 ms to load and process and write back. I am not sure about the Bigquery load time. For 50,000 5 MB files, 12 to 24 hours for one thread on a large Compute Engine instance (you need high network bandwidth).

Another option is to spin up multiple Compute Engines. One engine will put the names of N files (something like 4 or 16) per message into Pub/Sub. Then multiple Compute instances subscribe to the same topic and process the files in parallel. Again, this is only another 100 lines of code.

If your project consists of many millions of files, network bandwidth and compute time will be an issue unless time is not a factor.