Search code examples
pandasdrop-duplicates

why does python pandas DataFrame() returns 'duplicated' when value is duplicate


To my knowledge, "ValueError: cannot reindex on an axis with duplicate labels" means that you have two or more indix labels (or column labels) have the common name and pandas cannot decide which rows or columns to use.

however, when I created a Dataframe and assign the same values, though with unique labels, it seems to occur.

test=pd.DataFrame(data=np.arange(12).reshape(4,3),index=np.arange(4),columns=np.arange(3))
test.duplicated()

returns False for all indices,

while

test=pd.DataFrame(data=np.zeros(12).reshape(4,3),index=np.arange(4),columns=np.arange(3))
test.duplicated()

produce retruns True except for the first index.

What I misunderstand about the behavior of pandas dataframe?

Thanks.

I want to know my misunderstanding ^_^


Solution

  • By default, the first occurrence of two or more duplicates will be set to False. It essentially means that first occurrence is not a duplicate and all other occurrences are duplicates.

    It returns false for all the rows in first example because no rows are repeated. Whereas in the second example, all rows are repeated with zeroes. That makes the first row the original one (hence false) and all others a duplicate (hence true).