Search code examples
pythonregexpython-re

How to extract names from unordered data string using regular expression in python?


I need to extract the names of the people from the following sentence.


Input: BENCH: MAHAJAN, MEHR CHAND BENCH: MAHAJAN, MEHR CHAND DAS, SUDHI RANJAN BOSE, VIVIAN HASAN, GHULAM CITATION: 1953 AIR 28 1953 SCR 197

Output: MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN


For extracting the name from the first part of the sentence, I used the following code.

bench = re.search('BENCH: (.*?) BENCH', contents)
if bench:
    bench = bench.group(1)
    bench = ' '.join(reversed(bench.split(",")))
    print(bench)

Output: MEHR CHAND MAHAJAN


Solution

  • You could use this regex to match the names in your input data:

    ((?:\w+), (?:\w+(?: \w+)?))(?= BENCH:| CITATION:| \w+,)
    

    This looks for a word (\w+), followed by a comma and then one or two words separated by a space (\w+(?: \w+)?), and then uses a forward lookahead to assert that those words must be followed by one of BENCH:, CITATION: or another word followed by a comma (\w+,).

    names = re.findall(r'((?:\w+), (?:\w+(?: \w+)?))(?= BENCH:| CITATION:| \w+,)', contents)
    

    For your sample data, this yields:

    ['MAHAJAN, MEHR CHAND', 'MAHAJAN, MEHR CHAND', 'DAS, SUDHI RANJAN', 'BOSE, VIVIAN', 'HASAN, GHULAM']
    

    This list can then be reformatted as you desire:

    names = ', '.join((map(lambda n:' '.join(n.split(', ')[-1::-1]), names)))
    

    Output:

    'MEHR CHAND MAHAJAN, MEHR CHAND MAHAJAN, SUDHI RANJAN DAS, VIVIAN BOSE, GHULAM HASAN'