Search code examples
pythonpandasdatabasedataframedata-analysis

Create a dictionary with multiple Split of data frame from big uneven Pandas data frame


I have a messy big CSV file with lot of Nan values and I read the dataframe using pd.read_csv(file, names = range(int)). I wanted to split this data into multiple dataframe and store in dictionary using given key in data. I prepared a simple example to explain my problem.

Example rawdata: My data looks like similarly as given but with more number of columns and rows.

import pandas as pd
import numpy as np
df = pd.DataFrame(columns=([1,2,3,4]))
df.loc[0,:] =  ['Home -AA',np.nan,np.nan,np.nan]
df.loc[1,:] =  ['place/time','value1','value2','value3']
df.loc[2,:] = ['Home time1',1, 2, 3]
df.loc[3,:] = ['Home time2',4, 5, 6]
df.loc[4,:] = ['Home time3',7, 8, 9]
df.loc[5,:] = ['sum',11,np.nan , np.nan] 
df.loc[6,:] = ['agg',12,np.nan , np.nan] 
df.loc[7,:] = ['max',6,np.nan , np.nan] 
df.loc[8,:] = ['min',8,np.nan , np.nan] 
df.loc[9,:] = ['med',1,np.nan , np.nan] 
df.loc[10,:] =  ['Home -BB',np.nan,np.nan,np.nan]
df.loc[11,:] =  ['place/time','value1','value2','value3']
df.loc[12,:] = ['Home time1',11, 12, 13]
df.loc[13,:] = ['Home time2',14, 15, 16]
df.loc[14,:] = ['Home time3',17, 18, 19]
df.loc[15,:] = ['sum',101,np.nan , np.nan] 
df.loc[16,:] = ['agg',122,np.nan , np.nan] 
df.loc[17,:] = ['max',62,np.nan , np.nan] 
df.loc[18,:] = ['min',83,np.nan , np.nan] 
df.loc[19,:] = ['med',12,np.nan , np.nan] 
df.loc[20,:] =  ['Home -CC',np.nan,np.nan,np.nan]
df.loc[21,:] =  ['place/time','value1','value2','value3']
df.loc[22,:] =  ['Home -DD',np.nan,np.nan,np.nan]
df.loc[23,:] =  ['place/time','value1','value2','value3']
df.loc[24,:] =  ['Home -EE',np.nan,np.nan,np.nan]
df.loc[25,:] =  ['place/time','value1','value2','value3']
df.loc[26,:] =  ['Home -FF',np.nan,np.nan,np.nan]
df.loc[27,:] =  ['place/time','value1','value2','value3']
df.loc[28,:] = ['Home time1',211, 212, 213]
df.loc[29,:] = ['Home time1',212, 213, 214]
df.loc[30,:] = ['sum',115,np.nan , np.nan] 
df.loc[31,:] = ['agg',124,np.nan , np.nan] 
df.loc[32,:] = ['max',65,np.nan , np.nan] 
df.loc[33,:] = ['min',85,np.nan , np.nan] 
df.loc[34,:] = ['med',16,np.nan , np.nan] 

Wanted Results: I want to convert this data frame into multiple data frame with define house keys and store as in dictionary dict1. (Result example)

df1 = pd.DataFrame(columns=([1,2,3,4]))
df1.loc[1,:] =  ['place/time','value1','value2','value3']
df1.loc[2,:] = ['Home time1',1, 2, 3]
df1.loc[3,:] = ['Home time2',4, 5, 6]
df1.loc[4,:] = ['Home time3',7, 8, 9]

df2 = pd.DataFrame(columns=([1,2,3,4]))
df2.loc[11,:] =  ['place/time','value1','value2','value3']
df2.loc[12,:] = ['Home time1',11, 12, 13]
df2.loc[13,:] = ['Home time2',14, 15, 16]
df2.loc[14,:] = ['Home time3',17, 18, 19]


df3 = pd.DataFrame(columns=([1,2,3,4]))
df3.loc[21,:] =  ['place/time','value1','value2','value3']

df4 = pd.DataFrame(columns=([1,2,3,4]))
df4.loc[23,:] =  ['place/time','value1','value2','value3']

df5 = pd.DataFrame(columns=([1,2,3,4]))
df5.loc[25,:] =  ['place/time','value1','value2','value3']

df6 = pd.DataFrame(columns=([1,2,3,4]))
df6.loc[27,:] =  ['place/time','value1','value2','value3']
df6.loc[28,:] = ['Home time1',211, 212, 213]
df6.loc[29,:] = ['Home time1',212, 213, 214]

dict1 = {'House -AA':df1, 'House -BB': df2,'House -CC': df3 , 'House -DD':df4, 'House -EE':df5, 'House -FF':df6}

Prepared the code using for loop but I cannot able to split all the data frame in correct way. If I do not break the loop then I will receive an error of (List index out of range). Can you please help me to get similar results as I explained above?

Prepared code idea:

namesplit = lambda x: x.split('-')[0]
postion = 'Home '
rawname = []
for i in df[1]:
    x = namesplit(i)
    if postion == x:
        rawname.append(i)
        
test = {}
for i in range(len(rawname)):
    x = df[df[1]==rawname[i]].index.values
    y = df[df[1]==rawname[i+1]].index.values
    if y == len(df) -9:
        break
    df_1 = df.iloc[x[0]:y[0], :]
    test[rawname[i]] = df_1

Solution

  • You can simply use groupby and cumsum:

    result = {}
    
    for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
        name, d = i[1].iat[0], i.iloc[1:]
        result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]
    
    print (result)
    
    {'Home -AA':             1       2       3       4
    1  place/time  value1  value2  value3
    2  Home time1       1       2       3
    3  Home time2       4       5       6
    4  Home time3       7       8       9, 
    'Home -BB':              1       2       3       4
    11  place/time  value1  value2  value3
    12  Home time1      11      12      13
    13  Home time2      14      15      16
    14  Home time3      17      18      19, 
    'Home -CC':              1       2       3       4
    21  place/time  value1  value2  value3, 
    'Home -DD':              1       2       3       4
    23  place/time  value1  value2  value3, 
    'Home -EE':              1       2       3       4
    25  place/time  value1  value2  value3, 
    'Home -FF':              1       2       3       4
    27  place/time  value1  value2  value3
    28  Home time1     211     212     213
    29  Home time1     212     213     214}