Search code examples
python-3.xgoogle-cloud-platformgoogle-bigqueryscrapyscrapy-pipeline

Google BigQuery Update is 70x slower then Insert. How to fix?


Im using BigQuery as my DB with Scrapy spider. Below are 2 pipelines to store data into DB. One uses Insert, another Update methods. The Update method is 70 times slower then insert (merely 20 updated records per minute). Update take 3.560 seconds while Insert only 0.05 seconds. Where am I wrong and how to speed up Update method?

P.S. Current table size is around 20k records, potential size can be as large as 500 000 records. Need to update records daily.

Update method

# Define the update query
query = f"""
        UPDATE `{self.dataset_id}.{self.table_id}`
        SET `Sold Status` = '{data['Sold Status']}',
            `Amount of Views` = '{data['Amount of Views']}',
            `Amount of Likes` = '{data['Amount of Likes']}',
            `Sold Date & Time` = '{data['Sold Date & Time']}'
        WHERE `Item number` = '{data['Item number']}'
    """
start_time = time.time()
# Run the update query
job = self.client.query(query)

# Wait for the job to complete
job.result()

# Check if the query was successful
if job.state == 'DONE':
    print('Update query executed successfully.')
else:
    print('Update query failed.')
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)

return item  

Insert method

start_time = time.time()
data = item
slug = data['slug']
if slug in self.ids_seen:
    raise DropItem("Duplicate item found: {}".format(slug))
else:
    data.pop('slug', None)
    self.ids_seen.add(slug)
    table_ref = self.client.dataset(self.dataset_id).table(self.table_id)

    # Define the rows to be inserted
    rows = [
        data
    ]

    # Insert rows into the table
    errors = self.client.insert_rows_json(table_ref, rows)

    if errors == []:
        print("Rows inserted successfully.")
    else:
        print("Encountered errors while inserting rows:", errors) 
    end_time = time.time()
    execution_time = end_time - start_time
    logging.info(execution_time)
    
    return item

Solution

  • Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.

    Basically what i do is:

    1. Create new temporary table.
    2. Batch append all scraped data as json to this table
    3. Run Update command from new table to my master table (which takes merely 5 seconds overall instead of 5 seconds per update query. How come?)
    4. Truncate temporary table getting it ready for the next use in couple hours (FYI you cant use truncated table immediately. smth between 2 and 15 minutes should pass for the table to be ready for insert. So far im pretty happy with results and will stick to this solution for a while.