Search code examples
pythonregexpycharmocr

How to extract a human name in a string python


I have a string out of an OCR'ed image, and I need to find a way to extract human names from it. here is the image required to OCR, which comes out as:

From: Al Amri, Salim <[email protected]>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <[email protected]>

Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: $4540

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500

Room: Nizwa. Available on 13/12/2022. Rate: $1200

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri

There are 4 names in total in the heading, and i am required to get the output:

names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output

but I have no idea how to extract the names. I have tried RegEx and came up with:

names = re.findall(r'(?i)([A-Z][a-z]+[A-Z][a-z][, ] [A-Z][a-z]+)', string) #regex to find names

which searches for a Capital letter, then a comma, then another word starting with a capital letter. it is close to the desired result but it comes out as:

names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result

I have thought of maybe using another string that extracts the room names and excludes them from the list, but i have no idea how to implement that idea. i am new to RegEx, so any help will be appreciated. thanks in advance


Solution

  • Notwithstanding the excellent RE approach suggested by @JvdV, here's a step-by-step way in which you could achieve this:

    OCR = """From: Al Amri, Salim <[email protected]>
    
    Sent: 25 August 2021 17:20
    
    To: Al Harthi, Mohammed <[email protected]>
    
    Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>
    
    Subject: Conference Rooms Booking Details
    
    Dear Mohammed,
    
    As per our last discussion these are the available conference rooms available for booking along
    with their rates for full day:
    
    Room: Luban, available on 26/09/2021. Rate: $4540
    
    Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
    Room: Dhofar. Available on 11/11/2021. Rate: $2500
    
    Room: Nizwa. Available on 13/12/2022. Rate: $1200
    
       
    
    Please let me know which ones you are interested so we go through more details.
    Best regards,
    
    Salim Al Amri"""
    
    names = []
    for line in OCR.split('\n'):
        tokens = line.split()
        if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
            parts = line.split(';')
            for i, p in enumerate(parts):
                names.append(' '.join(p.split()[i==0:-1]))
    print(names)