Search code examples
pythonpandasdataframekeyerror

KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)


I am trying to write a function that takes in a dataframe, that has some columns that are within the same genre, and some columns are not. An example of the columns would be:

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

I am tryiong to generate two new dataframes, one with the columns without duplicates, and one with only the duplicate columns with the code below:

# Function that takes in a dataframe and returns new dataframes with all the sub-dataframes

def sub_dataframes(dataframe):

    copy = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Iterate through all the columns of the df
    for (col_name, col_data) in copy.iteritems():

        temp = str(col_name)
        rest = copy.iloc[:, 1:]
        new_df = [[]]

        # If it's not a duplicate, we just add it to the new df
        if len(temp) < 6:
            new_df[temp] = copy[col_data]

        # If the length of the column name is greater than or equal to 6, we know it's a duplicate
        if len(temp) >= 6:
            stripped = temp.rstrip(temp[2:])

            # Second for-loop to check the next column
            for (col_name2, col_data2) in rest.iteritems():
                temp2 = str(col_name2)
                rest2 = rest.iloc[:, 1:]
                only_dups = [[]]

                if len(temp2) >= 6:
                    stripped2 = temp2.rstrip(temp2[2:])

                    # Compare the two column names (without the integer 0,1, or 2)
                    if stripped[:-1] == stripped2[:-1]:

                        # Create new df of the two columns
                        only_dups[stripped] = col_data
                        only_dups[stripped2] = col_data2

                        # Third for-loop to check the remaining columns
                        for (col_name3, col_data3) in rest2.iteritems():
                            temp3 = str(col_name3)

                            if len(temp3) >= 6:
                                stripped3 = temp3.rstrip(temp3[2:])

                                # Compare the two column names (without the integer 0,1, or 2)
                                if stripped2[:-1] == stripped3[:-1]:
                                    only_dups[stripped3] = col_data3

    print("Original:\n{}\nWithout duplicates:\n{}\nDuplicates:\n{}".format(copy, new_df, only_dups))


sub_dataframes(df)

When I run this code, I get this error:

KeyError: "None of [Int64Index([ 22352, 106534,  23608,   8655,  49670, 101988,   9136, 
141284,\n             28564,  14262,\n            ...\n             76690, 150965, 
143106, 142370,  68004,  33980, 110832,  14491,\n            123511,   6207],\n           
dtype='int64', length=2833)] are in the [columns]"

I tried looking at other questions here on StackOverflow, to see if I can fix the problem, but all I've understood so far, is that I cannot add columns the way I have now, with new_df[temp] = copy[col_data] or only_dups[stripped] = col_data, but I can't seem to figure out how to properly create new columns. How do I add a new column based on the variables I have now? Is it possible, or do I have to re-write the code so that it doesn't have so many for-loops?

EDIT

An example of the output I want would be:

Original:
        id    t_dur0    t_dur1    t_dur2    ... 
0      22352  292720  293760.0  292733.0  
1     106534  213760  181000.0  245973.0 
2      23608  157124  130446.0  152450.0  
3       8655  127896  176351.0  166968.0  
4      49670  210320  226253.0  211880.0  
...      ...     ...       ...       ...

Without duplicates:
        id  
0      22352
1     106534
2      23608
3       8655
4      49670
...      ..

Duplicates: 
      t_dur0  t_dur1    t_dur2   
0     292720  293760.0  292733.0  
1     213760  181000.0  245973.0 
2     157124  130446.0  152450.0  
3     127896  176351.0  166968.0  
4     210320  226253.0  211880.0  
...      ...     ...       ... 

Solution

  • IIUC:

    def sub_dataframes(dataframe):
      # extract common prefix -> remove trailing digits
      cols = dataframe.columns.str.replace(r'\d*$', '', regex=True) \
                      .to_series().value_counts()
    
      # split columns
      unq_cols = cols[cols == 1].index
      dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]
    
      return (dataframe[unq_cols], dataframe[dup_cols])
    
    df1, df2 = sub_dataframes(df)
    

    Output:

    >>> df1
           id
    0   22352
    1  106534
    2   23608
    3    8655
    4   49670
    
    >>> df2
       t_dur0    t_dur1    t_dur2
    0  292720  293760.0  292733.0
    1  213760  181000.0  245973.0
    2  157124  130446.0  152450.0
    3  127896  176351.0  166968.0
    4  210320  226253.0  211880.0