I have a pandas dataframe with the following dtypes:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 579585 entries, 0 to 579613
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 itemName 579585 non-null object
1 itemId 579585 non-null string
2 Count 579585 non-null int32
3 Sales 579585 non-null float64
4 Date 579585 non-null datetime64[ns]
5 Unit_margin 579585 non-null float64
6 GrossProfit 579585 non-null float64
dtypes: datetime64[ns](1), float64(3), int32(1), object(1), string(1)
memory usage: 33.2+ MB
I upload it to a BigQuery table using:
df_extended_full.to_gbq('<MY DATSET>.profit', project_id='<MY PROJECT>', chunksize=None, if_exists='append', auth_local_webserver=False, location=None, progress_bar=True)
Everything seem to work well except that the itemId
column that is a string
becomes a float
and that all leading 0:s (which I need) are therefore deleted (wherever there are any).
I could of course define a schema for my table, but I want to avoid that. What am I missing?
The problem is for the “to_gbq” component. For some reason this output omits the quotes from the data field. And without quotes, it changes the datatype to a number.
BigQuery needs this format:
{"itemId": "12345", "mappingId":"abc123"}
You sent this format:
{"itemId": 12345, "mappingId":abc123}
A solution in this case. You can cast the field “itemId” from pandas using the command “astype”. Here is more documentation about this command.
This is an example.
df['externalId'] = df['externalId'].astype('str')
Another option is to use the parameter table_schema
with the to_gbq method. And list the Bigquery table fields which will be according to DataFrame conforms.
[{'name': 'col1', 'type': 'STRING'},...]
Last option, you can change to google-cloud-bigquery instead of pandas-gbq. You can see this comparison.