python-3.x pandas numpy dataframe string-conversion

Trouble converting matrix stored as string in excel to numpy array when using pandas dataframe

I am having a hard time reading in an excel file with pandas DataFrame and converting a stored matrix to a numpy array. I think part of the issue is that the matrix is improperly stored. I have no control over the spreadsheet however, this is how it was sent to me.

For instance this is the string stored in a cell

[[[ 0.        0.        0.107851]
  [ 0.        0.       -0.862809]]]

I read in the row with DataFrame, and save each cell to a variable. I then try to convert this particular variable to a np.array since those number represent two sets of x, y, z coordinates.

I have tried np.fromstring and np.asarray to no avail. It will convert the string to a numpy array but it will be a terrible mess with the brackets still inside as characters. I have tried using np.squeeze to get rid of the brackets but it says dimension is not 1.

if I use np.asarray(item._coord, dtype=float) then it fails saying it cannot convert the string to float.

ValueError: could not convert string to float: '[[[ 0. 0. 0.107851] [ 0. 0. -0.862809]]]'

There is a '\n' that shows up in the middle of it, between the two lists. I use df = df.replace(r'\n', ' ',regex=True)' to clean out the\n`'s prior to data conversion attempts.

I am stuck

Solution

Use custom function for convert to numpy array after read_excel:

a= np.array([[[ 0.,        0.,        0.107851],
              [ 0.,        0.,       -0.862809]]])
print (a)
[[[ 0.        0.        0.107851]
  [ 0.        0.       -0.862809]]]

df = pd.DataFrame({'col':[a,a,a]})
print (df)
                                               col
0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]

df.to_excel('test.xlsx', index=False)

import re
import ast
import numpy as np

#https://stackoverflow.com/a/44323021
def str2array(s):
    # Remove space after [
    s=re.sub('\[ +', '[', s.strip())
    # Replace commas and spaces
    s=re.sub('[,\s]+', ', ', s)
    return np.array(ast.literal_eval(s))

df = pd.read_excel('test.xlsx')

df['col'] = df['col'].apply(str2array)
print (df)
                                               col
0  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
1  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]
2  [[[0.0, 0.0, 0.107851], [0.0, 0.0, -0.862809]]]