Search code examples
pythonpython-3.xpandasregexpdfminer

Using regex to export data in PDF file to excel


I am using regex to get certain strings in a PDF file and write them to an excel file. The content of my PDF file is as follows:

Task 1: Question 1? answer1
Task 2: Question 2? (Format:****) answer2
Task 3: Question 3? answer3
Task 4: Question 4? (Format:*****) answer4

What I want to do is ignore the parts that say (Format:****).., for others the regex works fine, how can I do that?, so excel should be like below.

Excel

here my code:

import re
import pandas as pd
from pdfminer.high_level import extract_pages, extract_text

text = extract_text("file.pdf")

pattern1 = re.compile(r":\s*(.*\?)")
pattern2 = re.compile(r".*\?\s*(.*)")
matches1 = pattern1.findall(text)
matches2 = pattern2.findall(text)
df = pd.DataFrame({'Soru-TR': matches1})
df['Cevap'] = matches2
writer = pd.ExcelWriter('Questions.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()

Solution

  • You can use a single pattern with 2 capture groups, and optionally match a part between parenthesis after matching the question mark.

    ^[^:]*:\s*([^?]+\?)\s+(?:\([^()]*\)?\s)?(.*)
    

    Explanation

    • ^ Start of string
    • [^:]*: Match any char except : and then match :
    • \s* Match optional whitespace cahrs
    • ([^?]+\?) Capture group 1, match 1+ chars other than ? and then match ?
    • \s+ Match 1+ whitspace chars
    • (?:\([^()]*\)?\s)? Optionally match from an opening till closing (...)
    • (.*) Capture group 2, match the rest of the line

    See a regex demo.

    Example code

    import re
    
    pattern = r"^[^:]*:\s*([^?]+\?)\s+(?:\([^()]*\)?\s)?(.*)"
    
    s = ("Task 1: Question 1? answer1\n"
                "Task 2: Question 2? (Format:****) answer2\n"
                "Task 3: Question 3? answer3\n"
                "Task 4: Question 4? (Format:*****) answer4")
    
    matches = re.finditer(pattern, s, re.MULTILINE)
    matches1 = []
    matches2 = []
    for matchNum, match in enumerate(matches, start=1):
        matches1.append(match.group(1))
        matches2.append(match.group(2))
    
    print(matches1)
    print(matches2)
    

    Output

    ['Question 1?', 'Question 2?', 'Question 3?', 'Question 4?']
    ['answer1', 'answer2', 'answer3', 'answer4']