Search code examples
pythonalgorithmlogicextracttext-extraction

Extract data between two lines from text file


Say I have hundreds of text files like this example :

NAME
John Doe

DATE OF BIRTH

1992-02-16

BIO 

THIS is
 a PRETTY
 long sentence

 without ANY structure 

HOBBIES 
//..etc..

NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.

I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.

This is what I turned up with :

lines = f.readlines()
for line_number, line in enumerate(lines):
    if "NAME" in line:     
        name = lines[line_number + 1]  # In all files, Name is one line long.
    elif "DATE OF BIRTH" in line:
        date = lines[line_number + 2] # Date is also always two lines after
    elif "BIO" in line:
        for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
            if "HOBBIES" not in lines[x]:
                bio += lines[x]
            else:
                break
    elif "HOBBIES" in line:
        #...

This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.

I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.

Is it possible?

Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).

That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.


Solution

  • I did it storing everything in a dictionary, see code below.

    f = open("test.txt")
    lines = f.readlines()
    dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
    for line_number, line in enumerate(lines):
        if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
            text = line.replace("\n","")
            dict_text[location].append(text)
        else:
            location = "".join((line.split()))