Search code examples
pythondaskparquet

Dask set column astype not working for me


I'm stuck casting Dask columns to a particular data type. For the sake of simplicity I'll provide the details for a single column PehRecID - a column of floats. I've confirmed that all the values are numeric.

Here's a summary of what I've tried:

Set the dataframe dtypes. I passed in a dict and get the results I expected. When I do print(df.dtypes) I get {'PehRecID': 'float64', 'PehAccrualCode': 'object', .... so I've succeeded in setting the dtypes.

Explicitly cast the column to float64 with this: df['PehRecID'] = df['PehRecID'].astype('float64')

When I try df.to_parquet('foo.parquet', engine='pyarrow') I get ValueError: could not convert string to float: 'PehRecID'

When I try print(df.head()) I also get ValueError: could not convert string to float: 'PehRecID'

So it seems the issue is with Dask, not Parquet.

The files I'm working with are sometimes too large for Pandas but not huge. In this case I'm actually experimenting with a fairly small file to get the basics right.


Solution

  • When working with dask, it's critical to know that dask evaluates tasks lazily and asynchronously, meaning that when you enter a command, dash schedules but does not carry out the command until it's needed (because of a write, compute, print, head, or another command that requires the results to be computed for that step). That means that errors can occur as a result of one step, but you might not see them until a couple commands later, and then at that point it may be too late to recover.

    In your case, df['PehRecID'] = df['PehRecID'].astype('float64') appears to be the culprit. You're getting the error ValueError: could not convert string to float: 'PehRecID' at a later step, such as df.head(), because you've assigned the result of a command with an error to a dataframe column, rendering that column unusable.

    As a very simple example, let's create a dask dataframe with four string values, the first three of which can be converted to int, the last of which can't:

    In [4]: df = ddf.from_pandas(
       ...:     pd.DataFrame({'A': ['1', '2', '3', 'not a value']}),
       ...:     npartitions=2,
       ...: )
       ...:
    
    In [5]: df
    Out[5]:
    Dask DataFrame Structure:
                        A
    npartitions=2
    0              object
    2                 ...
    3                 ...
    Dask Name: from_pandas, 2 tasks
    

    Note that calling df.astype(int) does not raise an error - this is because you have only scheduled the operation - you haven't actually carried it out:

    In [6]: df.astype(int)
    Out[6]:
    Dask DataFrame Structure:
                       A
    npartitions=2
    0              int64
    2                ...
    3                ...
    Dask Name: astype, 4 tasks
    

    Note that dtypes now shows int64, as this is the data type indicated as a result from the .astype(int) operation.

    Computing the result does raise the expected error:

    In [7]: df.astype(int).compute()
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-7-05b64497024c> in <module>
    ----> 1 df.astype(int).compute()
    ...
    ValueError: invalid literal for int() with base 10: 'not a value'
    

    You can run into trouble with this if you assign results in-place:

    In [8]: df['A'] = df['A'].astype(int)
    

    Again, the dataframe's dtypes have been changed to reflect the expected output of .astype(int):

    In [9]: df.dtypes
    Out[9]:
    A    int64
    dtype: object
    

    The effect is that now df cannot be computed:

    In [10]: df.compute()
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-14-9bb416d45ef6> in <module>
    ----> 1 df.compute()
    ...
    ValueError: invalid literal for int() with base 10: 'not a value'
    

    Note that this can be masked if the error does not occur in the requested partition. In my example, the error occurs in the second partition, so df.head(), which only uses the first partition, only triggers the astype(int) operation on the first partition and does not report an error:

    In [11]: df.head()
    /Users/delgadom/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/dask/dataframe/core.py:6383: UserWarning: Insufficient elements for `head`. 5 elements requested, only 2 elements available. Try passing larger `npartitions` to `head`.
      warnings.warn(msg.format(n, len(r)))
    Out[11]:
       A
    0  1
    1  2
    

    This can be impossible to recover from without scrapping the column or dataframe entirely and re-reading the data, as you have re-written the contents of column A with a future which reliably generates an error.

    So I think the answer to your particular issue is that your data is not clean - you do have strings in the columns somewhere.