I'm stuck casting Dask columns to a particular data type. For the sake of simplicity I'll provide the details for a single column PehRecID - a column of floats. I've confirmed that all the values are numeric.
Here's a summary of what I've tried:
Set the dataframe dtypes. I passed in a dict and get the results I expected. When I do print(df.dtypes)
I get {'PehRecID': 'float64', 'PehAccrualCode': 'object', ....
so I've succeeded in setting the dtypes.
Explicitly cast the column to float64 with this:
df['PehRecID'] = df['PehRecID'].astype('float64')
When I try df.to_parquet('foo.parquet', engine='pyarrow')
I get ValueError: could not convert string to float: 'PehRecID'
When I try print(df.head())
I also get ValueError: could not convert string to float: 'PehRecID'
So it seems the issue is with Dask, not Parquet.
The files I'm working with are sometimes too large for Pandas but not huge. In this case I'm actually experimenting with a fairly small file to get the basics right.
When working with dask, it's critical to know that dask evaluates tasks lazily and asynchronously, meaning that when you enter a command, dash schedules but does not carry out the command until it's needed (because of a write
, compute
, print
, head
, or another command that requires the results to be computed for that step). That means that errors can occur as a result of one step, but you might not see them until a couple commands later, and then at that point it may be too late to recover.
In your case, df['PehRecID'] = df['PehRecID'].astype('float64')
appears to be the culprit. You're getting the error ValueError: could not convert string to float: 'PehRecID'
at a later step, such as df.head()
, because you've assigned the result of a command with an error to a dataframe column, rendering that column unusable.
As a very simple example, let's create a dask dataframe with four string values, the first three of which can be converted to int, the last of which can't:
In [4]: df = ddf.from_pandas(
...: pd.DataFrame({'A': ['1', '2', '3', 'not a value']}),
...: npartitions=2,
...: )
...:
In [5]: df
Out[5]:
Dask DataFrame Structure:
A
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Note that calling df.astype(int)
does not raise an error - this is because you have only scheduled the operation - you haven't actually carried it out:
In [6]: df.astype(int)
Out[6]:
Dask DataFrame Structure:
A
npartitions=2
0 int64
2 ...
3 ...
Dask Name: astype, 4 tasks
Note that dtypes now shows int64
, as this is the data type indicated as a result from the .astype(int)
operation.
Computing the result does raise the expected error:
In [7]: df.astype(int).compute()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-05b64497024c> in <module>
----> 1 df.astype(int).compute()
...
ValueError: invalid literal for int() with base 10: 'not a value'
You can run into trouble with this if you assign results in-place:
In [8]: df['A'] = df['A'].astype(int)
Again, the dataframe's dtypes have been changed to reflect the expected output of .astype(int)
:
In [9]: df.dtypes
Out[9]:
A int64
dtype: object
The effect is that now df cannot be computed:
In [10]: df.compute()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-9bb416d45ef6> in <module>
----> 1 df.compute()
...
ValueError: invalid literal for int() with base 10: 'not a value'
Note that this can be masked if the error does not occur in the requested partition. In my example, the error occurs in the second partition, so df.head()
, which only uses the first partition, only triggers the astype(int)
operation on the first partition and does not report an error:
In [11]: df.head()
/Users/delgadom/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/dask/dataframe/core.py:6383: UserWarning: Insufficient elements for `head`. 5 elements requested, only 2 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(msg.format(n, len(r)))
Out[11]:
A
0 1
1 2
This can be impossible to recover from without scrapping the column or dataframe entirely and re-reading the data, as you have re-written the contents of column A with a future which reliably generates an error.
So I think the answer to your particular issue is that your data is not clean - you do have strings in the columns somewhere.