I am brand new to GCP and Cloud Data Fusion. I see that you can use this service to integrate data across data sources into a data lake.
I have a number of sftp providers offering files in different structured formats eg. csv, json, parquet, and avro
Ultimately I'd like this data to be available in BQ.
Before loading to BQ my first stop was going to be Google Cloud Storage, that way I have an immutable copy of the data.
The sftp site will have multiple files representing multiple tables.
/root/table_1
/root/table_2
/root/table_3
I'm first trying to see if I use a Cloud Data Fusion pipeline to copy the files from SFTP to GCS. This has proven to be challenging.
Please consider that Cloud Datafusion is an ETL (Extract-Transform-Load) tool; in which case, the pipeline will handle the data in the files and not the files as such; therefore, it’s difficult to define a simple pipeline for uploading files to Storage.
To solve this I thought of using the Javascript transform to use the client libraries; however, it does not allow you importing dependencies; thus, you can’t use them so the authentication step to the services could be complicated.
In the same way, since the files have different formats, it seems that a transformation would have to be defined to separate/treat the files depending on the type of file.
On the other hand, I understand that your usage scenario is:
Based on this, I consider a better option to use an orchestration tool like Cloud Composer.
Airflow uses a DAG (Directed Acyclic Graph) as a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies; in which case, your DAG would have these tasks:
Another benefit is you no longer need to worry about the file type since all the formats you mentioned are currently supported to create the table directly from them.