Search code examples
pythonpandasdataframeerror-handlingrow

Detecting erroneous consecutive rows without change in pandas dataframe


Let's say that I have the following DataFrame:

# Import pandas library
import pandas as pd
  
# initialize list of lists
data = [['tom', 10], ['tom', 10], ['sam', 23], ['sam', 23], ['sam', 23], ['alice', 23], ['alice', 30], ['alice', 30]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])
  
# print dataframe.
df

    Name  Age
0    tom   10
1    tom   10
2    sam   23
3    sam   23
4    sam   23
5  alice   23
6  alice   30
7  alice   30

A data error occurs in the data where the 'Age' value does not change with a new consecutive 'Name' entry. In other words, row 5 contains a data entry error. How could I go about detecting and printing the rows in which this occurs? Thank you.


Solution

  • You can compare successive values:

    # are the successive Age identical?
    m1 = df['Age'].diff().eq(0)
    # are the successive Name different?
    m2 = df['Name'].ne(df['Name'].shift())
    
    # show the rows where both conditions are True
    df[m1&m2]
    

    output:

        Name  Age
    5  alice   23
    

    To assign a value:

    df.loc[m1&m2, 'comment'] = 'ERROR'
    
        Name  Age comment
    0    tom   10     NaN
    1    tom   10     NaN
    2    sam   23     NaN
    3    sam   23     NaN
    4    sam   23     NaN
    5  alice   23   ERROR
    6  alice   30     NaN
    7  alice   30     NaN
    

    To map two values:

    import numpy as np
    df['error'] = np.where(m1&m2, 'yes', 'no')
    
        Name  Age error
    0    tom   10    no
    1    tom   10    no
    2    sam   23    no
    3    sam   23    no
    4    sam   23    no
    5  alice   23   yes
    6  alice   30    no
    7  alice   30    no