Search code examples
pythonstringlistnlp

How to keep only strings which follows a specific form in a list (Python)


i have a corpus text extracted from pdf file defined in this list below

list=["7.1 PLAN COST MANAGEMENT",'Plan Cost Management is the process of defining how the project costs will be estimated','7.1.1 PLAN COST MANAGEMENT: INPUTS','Described in Section 4.2.3.1. The project charter provides the preapproved financial ','7.1.1.1 PROJECT CHARTER']

However , i wanted to extract only the titles found in this list which owns a specific form as shown in the example [(d.d.d.d + upper case title) or (d.d.d + upper case title) or (d.d + upper case title)] & getting rid of the rest. I don't really know how to encounter this properly. Any help is appreciated


Solution

  • This is a perfect use case for regular expressions. Here's some code to do what you're asking:

    import re
    
    list = ["7.1 PLAN COST MANAGEMENT",
            'Plan Cost Management is the process of defining how the project costs will be estimated',
            '7.1.1 PLAN COST MANAGEMENT: INPUTS',
            'Described in Section 4.2.3.1. The project charter provides the preapproved financial ',
            '7.1.1.1 PROJECT CHARTER']
    
    exp = re.compile(r"(\d+(\.\d+){1,3}) +([A-Z :]+)")
    
    for x in list:
        m = exp.match(x)
        if m:
            print(m.group(0))
    

    Result:

    7.1 PLAN COST MANAGEMENT
    7.1.1 PLAN COST MANAGEMENT: INPUTS
    7.1.1.1 PROJECT CHARTER
    

    You weren't clear about what constitutes a valid "upper case title". This solution assumes that the ':' character and whitespace are valid characters in a title. You can adjust what's inside the square braces in the expression to tweak what you do or do not want to consider valid characters in titles.