I have a string out of an OCR'ed image, and I need to find a way to extract human names from it. here is the image required to OCR, which comes out as:
From: Al Amri, Salim <[email protected]>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <[email protected]>
Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri
There are 4 names in total in the heading, and i am required to get the output:
names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output
but I have no idea how to extract the names. I have tried RegEx and came up with:
names = re.findall(r'(?i)([A-Z][a-z]+[A-Z][a-z][, ] [A-Z][a-z]+)', string) #regex to find names
which searches for a Capital letter, then a comma, then another word starting with a capital letter. it is close to the desired result but it comes out as:
names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result
I have thought of maybe using another string that extracts the room names and excludes them from the list, but i have no idea how to implement that idea. i am new to RegEx, so any help will be appreciated. thanks in advance
Notwithstanding the excellent RE approach suggested by @JvdV, here's a step-by-step way in which you could achieve this:
OCR = """From: Al Amri, Salim <[email protected]>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <[email protected]>
Ce: Al hajri, Malik <[email protected]>; Omar, Naif <[email protected]>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri"""
names = []
for line in OCR.split('\n'):
tokens = line.split()
if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
parts = line.split(';')
for i, p in enumerate(parts):
names.append(' '.join(p.split()[i==0:-1]))
print(names)