google-cloud-platform google-cloud-data-fusion

Is it possible to use Cloud Data Fusion FTP -> GCS -> BQ

I am brand new to GCP and Cloud Data Fusion. I see that you can use this service to integrate data across data sources into a data lake.

I have a number of sftp providers offering files in different structured formats eg. csv, json, parquet, and avro

Ultimately I'd like this data to be available in BQ.

Before loading to BQ my first stop was going to be Google Cloud Storage, that way I have an immutable copy of the data.

The sftp site will have multiple files representing multiple tables.

/root/table_1
/root/table_2
/root/table_3

I'm first trying to see if I use a Cloud Data Fusion pipeline to copy the files from SFTP to GCS. This has proven to be challenging.

Can I use Fusion for this?
Do I need to provide the schema for each file or can it be inferred?
Do I need to manually enumerate every table? Ideally I'd like to copy all the files as is from SFTP to GCS
Once in GCS I'd like to make an external data source in BigQuery for each file. Is that possible?

Solution

Please consider that Cloud Datafusion is an ETL (Extract-Transform-Load) tool; in which case, the pipeline will handle the data in the files and not the files as such; therefore, it’s difficult to define a simple pipeline for uploading files to Storage.

To solve this I thought of using the Javascript transform to use the client libraries; however, it does not allow you importing dependencies; thus, you can’t use them so the authentication step to the services could be complicated.

In the same way, since the files have different formats, it seems that a transformation would have to be defined to separate/treat the files depending on the type of file.

On the other hand, I understand that your usage scenario is:

Extract files from one or more SFTP server
The files have different formats (csv, json, parquet, and avro)
Files need to be uploaded to Cloud Storage
Each Storage file is associated with a BigQuery table as an external source

Based on this, I consider a better option to use an orchestration tool like Cloud Composer.

Airflow uses a DAG (Directed Acyclic Graph) as a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies; in which case, your DAG would have these tasks:

Use the sftp_operator to obtain the files from the sftp servers within the workers
Once the files are in the workers, you can use the PythonOperator to use the Cloud Storage library to upload the files to a bucket in your project.
Once the files are in Storage, you can use a PythonOperator to use the BigQuery library or a BashOperator to use bq load to create the tables with each file as external source

Another benefit is you no longer need to worry about the file type since all the formats you mentioned are currently supported to create the table directly from them.