Search code examples
python-3.xpandasnumpydataframestring-conversion

Trouble converting matrix stored as string in excel to numpy array when using pandas dataframe


I am having a hard time reading in an excel file with pandas DataFrame and converting a stored matrix to a numpy array. I think part of the issue is that the matrix is improperly stored. I have no control over the spreadsheet however, this is how it was sent to me.

For instance this is the string stored in a cell

[[[ 0.        0.        0.107851]
  [ 0.        0.       -0.862809]]]

I read in the row with DataFrame, and save each cell to a variable. I then try to convert this particular variable to a np.array since those number represent two sets of x, y, z coordinates.

I have tried np.fromstring and np.asarray to no avail. It will convert the string to a numpy array but it will be a terrible mess with the brackets still inside as characters. I have tried using np.squeeze to get rid of the brackets but it says dimension is not 1.

if I use np.asarray(item._coord, dtype=float) then it fails saying it cannot convert the string to float.

ValueError: could not convert string to float: '[[[ 0. 0. 0.107851] [ 0. 0. -0.862809]]]'

There is a '\n' that shows up in the middle of it, between the two lists. I use df = df.replace(r'\n', ' ',regex=True)' to clean out the\n`'s prior to data conversion attempts.

I am stuck


Solution

  • Use custom function for convert to numpy array after read_excel:

    a= np.array([[[ 0.,        0.,        0.107851],
                  [ 0.,        0.,       -0.862809]]])
    print (a)
    [[[ 0.        0.        0.107851]
      [ 0.        0.       -0.862809]]]
    
    df = pd.DataFrame({'col':[a,a,a]})
    print (df)
                                                   col
    0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
    1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
    2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
    
    df.to_excel('test.xlsx', index=False)
    

    import re
    import ast
    import numpy as np
    
    #https://stackoverflow.com/a/44323021
    def str2array(s):
        # Remove space after [
        s=re.sub('\[ +', '[', s.strip())
        # Replace commas and spaces
        s=re.sub('[,\s]+', ', ', s)
        return np.array(ast.literal_eval(s))
    
    df = pd.read_excel('test.xlsx')
    
    df['col'] = df['col'].apply(str2array)
    print (df)
                                                   col
    0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
    1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
    2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]