Search code examples
pythondataframesklearn-pandas

Removing rows with a duplicate column pandas dataframe (Python)


I have a csv, which I read using pandas and created a dataframe. The dataframe looks like this:

description     title
lorem ipsum       A
ipsum lorem       A
dolor sit amet    C
amet sit dolor    B

It has 1034 rows and 2 columns

Now I want to remove all the rows with duplicate titles from the dataframe and have the dataframe like this:

description     title
lorem ipsum       A
dolor sit amet    C
amet sit dolor    B

I found a solution which says to remove duplicates using drop_duplicates(). In my scenerio I did:

df.drop_duplicates('title', inplace = True)

When I print df it still shows 1034 rows, but at the end it displays [967 x 2], which means it has 967 rows and it did remove duplicates. Even doing df.shape tells me the same thing. But when I print or iterate over it seems to not work. In-fact even print length of particular column gives me 967. Example: print len(df['title']) gives me 967. Is it just that the dataframe indices are numbered the same? Or it really still has 1034 rows? What could be the issue?

I am attaching my code:

df = pd.read_csv('latestdata.csv', sep='\t')
df.drop_duplicates('title', inplace=True)
print df

Solution

  • The drop duplicates works fine. Your code is fine. Here is the explanation as to what's happening. When you create a pandas dataframe and do not specify an index , pandas indexes the rows on it's own, a simple increasing integer value.

    When you drop the duplicates, all indices which are duplicates are dropped. Do the following if you want to reset your index:

    df.reset_index(inplace=True)
    

    And your dataframe will get re-indexed and you will see the last index to be 967 when you print your df.