Let's say that I have the following DataFrame:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['tom', 10], ['sam', 23], ['sam', 23], ['sam', 23], ['alice', 23], ['alice', 30], ['alice', 30]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])
# print dataframe.
df
Name Age
0 tom 10
1 tom 10
2 sam 23
3 sam 23
4 sam 23
5 alice 23
6 alice 30
7 alice 30
A data error occurs in the data where the 'Age' value does not change with a new consecutive 'Name' entry. In other words, row 5 contains a data entry error. How could I go about detecting and printing the rows in which this occurs? Thank you.
You can compare successive values:
# are the successive Age identical?
m1 = df['Age'].diff().eq(0)
# are the successive Name different?
m2 = df['Name'].ne(df['Name'].shift())
# show the rows where both conditions are True
df[m1&m2]
output:
Name Age
5 alice 23
To assign a value:
df.loc[m1&m2, 'comment'] = 'ERROR'
Name Age comment
0 tom 10 NaN
1 tom 10 NaN
2 sam 23 NaN
3 sam 23 NaN
4 sam 23 NaN
5 alice 23 ERROR
6 alice 30 NaN
7 alice 30 NaN
To map two values:
import numpy as np
df['error'] = np.where(m1&m2, 'yes', 'no')
Name Age error
0 tom 10 no
1 tom 10 no
2 sam 23 no
3 sam 23 no
4 sam 23 no
5 alice 23 yes
6 alice 30 no
7 alice 30 no