i have a corpus text extracted from pdf file defined in this list below
list=["7.1 PLAN COST MANAGEMENT",'Plan Cost Management is the process of defining how the project costs will be estimated','7.1.1 PLAN COST MANAGEMENT: INPUTS','Described in Section 4.2.3.1. The project charter provides the preapproved financial ','7.1.1.1 PROJECT CHARTER']
However , i wanted to extract only the titles found in this list which owns a specific form as shown in the example [(d.d.d.d + upper case title) or (d.d.d + upper case title) or (d.d + upper case title)]
& getting rid of the rest. I don't really know how to encounter this properly.
Any help is appreciated
This is a perfect use case for regular expressions. Here's some code to do what you're asking:
import re
list = ["7.1 PLAN COST MANAGEMENT",
'Plan Cost Management is the process of defining how the project costs will be estimated',
'7.1.1 PLAN COST MANAGEMENT: INPUTS',
'Described in Section 4.2.3.1. The project charter provides the preapproved financial ',
'7.1.1.1 PROJECT CHARTER']
exp = re.compile(r"(\d+(\.\d+){1,3}) +([A-Z :]+)")
for x in list:
m = exp.match(x)
if m:
print(m.group(0))
Result:
7.1 PLAN COST MANAGEMENT
7.1.1 PLAN COST MANAGEMENT: INPUTS
7.1.1.1 PROJECT CHARTER
You weren't clear about what constitutes a valid "upper case title". This solution assumes that the ':' character and whitespace are valid characters in a title. You can adjust what's inside the square braces in the expression to tweak what you do or do not want to consider valid characters in titles.