Search code examples
pythonpandasdataframenanxlsx

Grouping and Splitting Data from a single column of a Pandas Dataframe Based on datatype (float,Nan,string)


Based on my specific issue I tried to create a small reproducible example without coming to a point. It is attached at the end of the question.

I am currently reading 27 excel files.

All formatted in the same way (apparently seems that all the column are consistent each other).

I am appending this data in a single DataFrame composed by 9828 rows and three columns.

One column called "Target" should be composed only by float or interger.

Actually cells that compose Target column are read as:

  1. Float
  2. In only one case an empty cell is read as a string
  3. In other cases empty cells are considered None

In order to understand which specific Column DataFrame row contains float/none/string I created a very inefficient way to check and split the dataframe based on the datatype

    dt_t is a dataframe with 3 columns one of this is called Target
    This column is composed by 9398 non-null object. 
    The other two by 9828 non-null object.

    df_string=pd.DataFrame()
    df_float=pd.DataFrame()
    df_null=pd.DataFrame()

    for i in range(len(df_t.Target)):
      if type(df_t.Target.loc[i])==str:
         df_string=df_string.append(df_t.loc[i])
      if type(df_t.Target.loc[i])==int or type(df_t.Target.loc[i])==float:
          df_float=df_float.append(df_t.loc[i])
      else: 
          df_null=df_null.append(df_t.loc[i])

I can't understand why:

  1. There is a duplicate. df_string and df_null contain the same value

  2. In the float dataframe I still have nan values

My expected output was to see

  1. Only float and int in one dataframe
  2. Only nan in df_null
  3. Only string in df_string dataframe

The most elegant way to extract nan values was in this answer but I think my issue is slightly different

df.loc[~df.index.isin(df.dropna().index)]

I tried also to find some resources and read the documentation to understand why pd.read_excel() has this behaviour, but I didn't find anything useful.

Here the code I tried to create for a reproducible question (but is not working):

    a=pd.Series(np.random.uniform(1.00,100.00,9000))
    b=np.empty(400)
    b=pd.Series(np.full_like(b,np.nan))
    c=pd.Series('None')
    start_dict={"Target":[0,2,3.5] }
    df_t=pd.DataFrame(start_dict)
    df_t=pd.concat([df_t,a],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,b],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,c],axis=0,ignore_index=True)

Here my actual Output when I print df_string and df_null:

enter image description here

Here in yellow the reference cell in the xlsx file:

enter image description here


Solution

  • Type is not safe to use. Since columns have mixed types pandas will use the object type for all rows. However, you can do this in another way.

    First create the empty DataFrames:

    df_string = pd.DataFrame()
    df_float = pd.DataFrame()
    df_null = pd.DataFrame()
    

    Now you need to create a function that categorizes your original DataFrame. A slightly modified version of this question should do:

    def categorize(x):
        if x.isna():
            df_null = df_null.append([x]) # append does not change the original df
            return
        try:
            float(x)
        except ValueError:
            df_string = df_string.append([x])
        else:
            df_float = df_float.append([x])
    

    Now all you have to do is apply the function to the Target column.

    df['Target'].apply(lambda x: categorize(x), axis=1)