Search code examples
pythonpdfdatasetpypdf

How can I extract separated content from questions in a PDF of the ENEM (brazilian exam)?


I want to extract the questions of an exam for building a dataset. Here we have an example page of the ENEM, the specific exam I am working with:

Page 4 - ENEM 2022 (Day 1 / Blue)

This is the page 4 of 2022 edition, available here in "microdados_enem_2022/PROVAS E GABARTIOS/ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf" directory.

This is the classical example of a normal page in the exam, in this specific case, I selected a page with no image in the questions and with all the questions in only one page to make it easier. Besides that, the desired content is colored to separate what is what. So, the objective is to generate a dataset with a list of questions, each one with the features:

  1. The text (in yellow)
  2. The command or statement (in green)
  3. The alternatives (in blue)

How can I extract this features for generate dataset from this exam?

I'm trying to use the PyPDF library for Python, but I'm having some difficult to know how to process the extracted text to generate the dataset. Here is the code at the moment:

from PyPDF2 import PdfReader

# Open reader
reader = PdfReader("ENEM_2022_P1_CAD_01_DIA_1_AZUL.pdf")
        
parts = []
        
# Defining visitor function
def visitor_question(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)

# Selecting page
page_index = 3 #page x with index x-1
page = reader.pages[page_index]

# Extracting text
page.extract_text(visitor_text=visitor_question)

# Printing text
text_body = "".join(parts)
print(text_body)

Solution

  • The file structure is good curl -o 2022-p-cad1-blue.pdf https://download.inep.gov.br/enem/provas_e_gabaritos/2022_PV_impresso_D1_CD1.pdf#page=4

    enter image description here

    So why not simply export to file as text (seen on the right) and parse that in any language.

    xpdf-tools-win-4.04\bin32>pdftotext -enc UTF-8 -f 4 -l 4 2022-p-cad1-blue.pdf -

    By using -nopgbrk and adding margint and maginb you can remove most of the extra chatter and then just avoid the centre watermarking either with regex or by pulling left and right halves in two passes per page.

    to join multiple pages simply select the range -f 2 -l 31 for example with exclusions to aVoid the vertical text

    pdftotext -nopgbrk -raw -enc UTF-8 -x 20 -y 50 -W 700 -H 700 -f 2 -l 31 2022-p-cad1-blue.pdf -|findstr /V /R "ENEM 2022" >page2-31.txt