Search code examples
pythonpandasunicodebyte

How do I convert bytes to utf-8 without turning regular strings into NaNs?


I have a process that runs on multiple pandas dataframes. Sometimes the data comes in the form of bytes, such as:

>>> pd.DataFrame[['x']]
['x']
b'123'
b'111'
b'110'

And other times it comes in the form of regular integers

>>> pd.DataFrame[['x']]
['x']
80
123
491

I want to convert the bytes to unicode-8 and leave the regular integers untouched. Right now, I tried pd.Dataframe['x'].str.decode('unicode-8') and it works when the dataframe comes in the form of bytes, but it turns all the values to NaN when the dataframe comes in the form of integers.

I want the solution to be vectorized because speed is important. I can't use list comprehension, for example.


Solution

  • You can define a function to first check before decoding. Something like:

    import pandas as pd
    
    # Define the decode_if_bytes function
    def decode_if_bytes(input_str):
        if isinstance(input_str, bytes):
            return input_str.decode('utf-8')
        return input_str
    

    Decode df

    # Apply the function to the dataframe
    df = pd.DataFrame({'x':[b'80',123,491]})
    df['x'] = df['x'].apply(decode_if_bytes)
    
    print(df)
    

    Output:

    x
    0   80
    1  123
    2  491
    

    Decode another df

    df = pd.DataFrame({'x':[b'123',b'111',b'110']})
    df['x'] = df['x'].apply(decode_if_bytes)
    
    print(df)
    

    Output:

    x
    0  123
    1  111
    2  110