I am trying to detect errors in data in a CSV file using re.search
in which incorrect data (that does not match the given pattern) will be sent to one list (error), while the correct data (that does match the pattern) will be sent to another list (clean).
This is how the data looks in the CSV file:
UES9151GS5 DEN PEK
UES915*GS5 JFK FCO
WYu2010YH8 ORD CAN
HCA3158QA6 ORD ~AN
HCA3158QA6 KUL A;S
HCA3158QA6 0 LHR
HCA3158QA6 A;S ORD
HCA3158QA6 ~AN PVG
and this is my code:
import csv
import re
clean = []
error = []
pid_pattern = '[A-Z]{3}[0-9]{4}[A-Z]{2}[0-9]'
dept_pattern = '[A-Z]{3}'
arr_pattern = '[A-Z]{3}'
with open(r"test.csv") as csvfile:
reader = csvfile
for i in reader:
pid = re.search(pid_pattern, i)
dept = re.search(dept_pattern, i)
arr = re.search(arr_pattern, i)
if pid !=None and dept != None and arr != None:
clean.append(i)
elif pid == None:
error.append(i)
elif dept == None:
error.append(i)
elif arr == None:
error.append(i)
So, after I run the code I get:
clean
['UES9151GS5,DEN,PEK\n',
'HCA3158QA6,ORD,~AN\n',
'HCA3158QA6,A;S,A;S,\n',
'HCA3158QA6,0,LHR\n',
'HCA3158QA6,A;S,ORD\n',
'HCA3158QA6,~AN,PVG\n']
error
['UES915*GS5,JFK,FCO\n',
'WYu2010YH8,ORD,CAN\n']
Apparently the code only checks the first column (pid) and ignores the rest. The expected result should be like this:
clean
['UES9151GS5,DEN,PEK\n']
error
['HCA3158QA6,ORD,~AN\n',
'HCA3158QA6,A;S,A;S,\n',
'HCA3158QA6,0,LHR\n',
'HCA3158QA6,A;S,ORD\n',
'HCA3158QA6,~AN,PVG\n',
'UES915*GS5,JFK,FCO\n',
'WYu2010YH8,ORD,CAN\n']
Up until now I am unable to locate the error or find any alternative solution.
The problem is that the regex is triggering on the first match it finds always. Since the csv reader is returning the rows formatted like "PID,DEPT,ARR" it means that if PID has the formatting [A-Z]{3} in it it will find a match. To prevent this, either separate out the columns and search the regex on each part corresponding to the column (I'm not sure how to do this) or change the regex.
import csv
import re
clean = []
error = []
pid_pattern = '[A-Z]{3}[0-9]{4}[A-Z]{2}[0-9],.+,.+' // only look at the first column
dept_pattern = '.+,[A-Z]{3},.+' // only look at second column
arr_pattern = '.+,.+[A-Z]{3}' // only look at third column
with open(r"test.csv") as csvfile:
reader = csvfile
for i in reader:
pid = re.search(pid_pattern,i)
dept = re.search(dept_pattern,i)
arr = re.search(arr_pattern,i)
if pid !=None and dept != None and arr != None:
clean.append(i)
elif passenger_id == None:
error.append(i)
elif departure == None:
error.append(i)
elif arrival == None:
error.append(i)
or you could just combine the whole regex expression into one [A-Z]{3}[0-9]{4}[A-Z]{2}[0-9],[A-Z]{3},[A-Z]{3}
regex for using match and group capturing ([A-Z]{3}[0-9]{4}[A-Z]{2}[0-9]),([A-Z]{3}),([A-Z]{3})