Search code examples
pythonpandasdataframenumpydata-preprocessing

how to convert a panda dataframe column containing string object to a numpy array?


please i'am working on a project and i have to do some data preprocessing i have a dataframe that looks like this (this is just an example for simplification

index | pixels 
0     | 10 20 30 40 
1     | 11 12 13 14

and I want to convert it to a np array of shape (2,2,2,1) the type of the pixels column is object is there any solution to do that without loops cause I have a 28k rows data frame with big images ? i have tried looping but it takes so long to execute on my machine


Solution

  • Use str.split + astype + to_numpy + reshape:

    a = (
        df['pixels'].str.split(' ', expand=True)
            .astype(int).to_numpy()
            .reshape((2, 2, 2, 1))
    )
    

    a:

    [[[[10]
       [20]]
    
      [[30]
       [40]]]
    
    
     [[[11]
       [12]]
    
      [[13]
       [14]]]]
    

    Complete Working Example:

    import pandas as pd
    
    df = pd.DataFrame({'pixels': ['10 20 30 40', '11 12 13 14']})
    
    a = (
        df['pixels'].str.split(' ', expand=True)
            .astype(int).to_numpy()
            .reshape((2, 2, 2, 1))
    )
    print(a)