Search code examples
pythonpython-3.xregexcsvpython-re

Regular expression (re.search) is unable to detect errors


I am trying to detect errors in data in a CSV file using re.search in which incorrect data (that does not match the given pattern) will be sent to one list (error), while the correct data (that does match the pattern) will be sent to another list (clean).

This is how the data looks in the CSV file:

UES9151GS5  DEN PEK
UES915*GS5  JFK FCO
WYu2010YH8  ORD CAN
HCA3158QA6  ORD ~AN
HCA3158QA6  KUL A;S
HCA3158QA6  0   LHR
HCA3158QA6  A;S ORD
HCA3158QA6  ~AN PVG

and this is my code:

import csv
import re

clean = []
error = []

pid_pattern = '[A-Z]{3}[0-9]{4}[A-Z]{2}[0-9]'
dept_pattern = '[A-Z]{3}'
arr_pattern = '[A-Z]{3}'

with open(r"test.csv") as csvfile:
    reader = csvfile
    for i in reader:
        pid = re.search(pid_pattern, i)
        dept = re.search(dept_pattern, i)
        arr = re.search(arr_pattern, i)
        
        if pid !=None and dept != None and arr != None:
             clean.append(i)
        elif pid == None:
            error.append(i)
        elif dept == None:
            error.append(i)
        elif arr == None:
            error.append(i)

So, after I run the code I get:

clean
['UES9151GS5,DEN,PEK\n',
 'HCA3158QA6,ORD,~AN\n',
 'HCA3158QA6,A;S,A;S,\n',
 'HCA3158QA6,0,LHR\n',
 'HCA3158QA6,A;S,ORD\n',
 'HCA3158QA6,~AN,PVG\n']
error
['UES915*GS5,JFK,FCO\n',
 'WYu2010YH8,ORD,CAN\n']

Apparently the code only checks the first column (pid) and ignores the rest. The expected result should be like this:

clean
['UES9151GS5,DEN,PEK\n']
error
['HCA3158QA6,ORD,~AN\n',
 'HCA3158QA6,A;S,A;S,\n',
 'HCA3158QA6,0,LHR\n',
 'HCA3158QA6,A;S,ORD\n',
 'HCA3158QA6,~AN,PVG\n',
 'UES915*GS5,JFK,FCO\n',
 'WYu2010YH8,ORD,CAN\n']

Up until now I am unable to locate the error or find any alternative solution.


Solution

  • The problem is that the regex is triggering on the first match it finds always. Since the csv reader is returning the rows formatted like "PID,DEPT,ARR" it means that if PID has the formatting [A-Z]{3} in it it will find a match. To prevent this, either separate out the columns and search the regex on each part corresponding to the column (I'm not sure how to do this) or change the regex.

    import csv
    import re
    
    clean = []
    error = []
    
    pid_pattern = '[A-Z]{3}[0-9]{4}[A-Z]{2}[0-9],.+,.+' // only look at the first column
    dept_pattern = '.+,[A-Z]{3},.+' // only look at second column
    arr_pattern = '.+,.+[A-Z]{3}' // only look at third column
    
    with open(r"test.csv") as csvfile:
        reader = csvfile
        for i in reader:
            pid = re.search(pid_pattern,i)
            dept = re.search(dept_pattern,i)
            arr = re.search(arr_pattern,i)
    
            if pid !=None and dept != None and arr != None:
                 clean.append(i)
            elif passenger_id == None:
                error.append(i)
            elif departure == None:
                error.append(i)
            elif arrival == None:
                error.append(i)
    

    or you could just combine the whole regex expression into one [A-Z]{3}[0-9]{4}[A-Z]{2}[0-9],[A-Z]{3},[A-Z]{3}

    regex for using match and group capturing ([A-Z]{3}[0-9]{4}[A-Z]{2}[0-9]),([A-Z]{3}),([A-Z]{3})