Search code examples
google-cloud-platformgoogle-bigquerygoogle-cloud-dataflowdataflowgoogle-data

Oracle to Big Query Tables


I am new to big query and learning it. I have a requirement where I'll have to load close to 300 tables from oracle source to Big Query staging tables. what is the recommended way to load the data? I know I can use dataflow for that but do I have to create 300 dataflow tasks for it or create single job to iterate it? Please share your experience and different ways to do it. Thanks a lot.

Regards, Venkat.


Solution

  • In my experience , we wanted to migrate our data warehouse into bigquery , I didn't use dataflow or any tool I just exported the tables into csv files and then used python code to iterate over the file and upload them into bigquery https://cloud.google.com/bigquery/docs/loading-data-local#python

    or you can upload them into gcs and then into bigquery if it's a daily operation I think it's easier to maintain a single code that iterate over a list of tables extract them and append them into bigquery tables than creating 300 tasks

    update:

    example code for reading data from oracle to bigquery using pandas-gbq:

    import cx_Oracle
    from sqlalchemy import create_engine
    
    engine = create_engine('oracle://user:password@host_or_scan_address:1521/ORACLE_SERVIVE_NAME')
    
    results = pd.read_sql('select * from table_name', engine,chunk_size= 5000)
    if_exists = 'append' # or replace
    schema = [] #specify bq_schema here if you don't want the autodetect schema
    for result in results:
         result.to_gbq(destination_table='dataset_id.table_id', project_id='project-id',
                            table_schema=schema, if_exists=if_exists)
    
    

    you can delete the chunk_size argument if you want to load the data as one chunk but this might consume the memory if the table is big

    
    
    results = pd.read_sql('select * from table_name')
    if_exists = 'append' # or replace
    schema = [] #specify bq_schema here if you don't want the autodetect schema
    results.to_gbq(destination_table='dataset_id.table_id', project_id='project-id',
                            table_schema=schema, if_exists=if_exists)