Search code examples
pythonencodingutf-8nlp

How encode text can be converted to main text (without special character created by encoding)


I am going to extract text from a series of PDF files to do Topic Modeling. After extracting text from PdF files, I am going to save the text of each PDF file in a .txt file or .doc file. To do this, I had an error that I should add .encode('utf-8') for saving extracted text in a .txt file. So, I added txt = str(txt.encode('utf-8')). The problem is reading the .txt files, when I read the .txt files, they have special characters due to UTF-8, I don't know how I can have the main text without that characters. I applied to decode but it didn't work.

I applied another approach to avoid saving in .txt format, I was going to save the extracted text in a data frame, but I found that the few first pages were saved in data frame!

I would appreciate it if you could share your solutions to read from the .txt file and removing characters relating to encoding ('utf-8') and how I can save the extracted text in a data frame.

import pdfplumber
import pandas as pd
import  codecs

txt = ''

with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
    pages = pdf.pages
    for i, pg in enumerate (pages):
            txt += pages [i].extract_text()
        
print (txt)

data = {'text': [txt]}
df = pd.DataFrame(data)


####write in .txt file
text_file = open("Test.txt", "wt")
txt = str(txt.encode('utf-8'))
n = text_file.write(txt)
text_file.close()

####read from .txt file
with codecs.open('Test.txt', 'r', 'utf-8') as f:
    for line in f:
        print (line)

Solution

  • You are writing the file incorrectly. Rather than encoding the text, declare an encoding when you open the file, and write the text without encoding - Python will automatically encode it.

    It should be

    
    ####write in .txt file
    with open("Test.txt", "wt", encoding='utf-8') as text_file:
        n = text_file.write(txt)
    

    Unless you are using Python 2 you don't need to use codecs to open encoded files, again you can declare the encoding in the open function:

    with open("Test.txt", "rt", encoding='utf-8') as f:
        for line in f:
            print(line)