Search code examples
pythonpandas

python pandas remove duplicate columns


What is the easiest way to remove duplicate columns from a dataframe?

I am reading a text file that has duplicate columns via:

import pandas as pd

df=pd.read_table(fname)

The column names are:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

All the Time and Time Relative columns contain the same data. I want:

Time, Time Relative, N2, H2

All my attempts at dropping, deleting, etc such as:

df=df.T.drop_duplicates().T

Result in uniquely valued index errors:

Reindexing only valid with uniquely valued index objects

Sorry for being a Pandas noob. Any Suggestions would be appreciated.


Additional Details

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005

Solution

  • Here's a one line solution to remove columns based on duplicate column names:

    df = df.loc[:,~df.columns.duplicated()].copy()
    

    How it works:

    Suppose the columns of the data frame are ['alpha','beta','alpha']

    df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

    Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

    Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

    The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

    Note: the above only checks columns names, not column values.

    To remove duplicated indexes

    Since it is similar enough, do the same thing on the index:

    df = df.loc[~df.index.duplicated(),:].copy()
    

    To remove duplicates by checking values without transposing

    Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.

    df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
    

    This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:

    # create a large(ish) dataframe
    ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312))) 
    
    
    #to see size in gigs
    #ldf.memory_usage().sum()/1e9 #it's about 3 gigs
    
    # duplicate a column
    ldf.loc[:,'dup'] = ldf.loc[:,101]
    
    # take out duplicated columns by values
    ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()