Im using BigQuery
as my DB with Scrapy
spider.
Below are 2 pipelines to store data into DB. One uses Insert
, another Update methods. The Update method is 70 times
slower then insert (merely 20 updated records per minute). Update take 3.560
seconds while Insert only 0.05
seconds. Where am I wrong and how to speed up Update method?
P.S. Current table size is around 20k records, potential size can be as large as 500 000 records. Need to update records daily.
Update method
# Define the update query
query = f"""
UPDATE `{self.dataset_id}.{self.table_id}`
SET `Sold Status` = '{data['Sold Status']}',
`Amount of Views` = '{data['Amount of Views']}',
`Amount of Likes` = '{data['Amount of Likes']}',
`Sold Date & Time` = '{data['Sold Date & Time']}'
WHERE `Item number` = '{data['Item number']}'
"""
start_time = time.time()
# Run the update query
job = self.client.query(query)
# Wait for the job to complete
job.result()
# Check if the query was successful
if job.state == 'DONE':
print('Update query executed successfully.')
else:
print('Update query failed.')
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)
return item
Insert method
start_time = time.time()
data = item
slug = data['slug']
if slug in self.ids_seen:
raise DropItem("Duplicate item found: {}".format(slug))
else:
data.pop('slug', None)
self.ids_seen.add(slug)
table_ref = self.client.dataset(self.dataset_id).table(self.table_id)
# Define the rows to be inserted
rows = [
data
]
# Insert rows into the table
errors = self.client.insert_rows_json(table_ref, rows)
if errors == []:
print("Rows inserted successfully.")
else:
print("Encountered errors while inserting rows:", errors)
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)
return item
Ok. So thanks to ChatGPT I found workaround for this issue. What used to take my code 9 hours now takes under 15 minutes. So 36 fold improvement.
Basically what i do is: