I have a csv, which I read using pandas and created a dataframe. The dataframe looks like this:
description title
lorem ipsum A
ipsum lorem A
dolor sit amet C
amet sit dolor B
It has 1034 rows and 2 columns
Now I want to remove all the rows with duplicate titles from the dataframe and have the dataframe like this:
description title
lorem ipsum A
dolor sit amet C
amet sit dolor B
I found a solution which says to remove duplicates using drop_duplicates()
. In my scenerio I did:
df.drop_duplicates('title', inplace = True)
When I print df
it still shows 1034 rows, but at the end it displays [967 x 2], which means it has 967 rows and it did remove duplicates. Even doing df.shape tells me the same thing. But when I print or iterate over it seems to not work. In-fact even print length of particular column gives me 967. Example: print len(df['title'])
gives me 967
. Is it just that the dataframe indices are numbered the same? Or it really still has 1034 rows?
What could be the issue?
I am attaching my code:
df = pd.read_csv('latestdata.csv', sep='\t')
df.drop_duplicates('title', inplace=True)
print df
The drop duplicates works fine. Your code is fine. Here is the explanation as to what's happening. When you create a pandas dataframe and do not specify an index , pandas indexes the rows on it's own, a simple increasing integer value.
When you drop the duplicates, all indices which are duplicates are dropped. Do the following if you want to reset your index:
df.reset_index(inplace=True)
And your dataframe will get re-indexed and you will see the last index to be 967 when you print your df.