I am going to extract text from a series of PDF files to do Topic Modeling. After extracting text from PdF files, I am going to save the text of each PDF file in a .txt file or .doc file. To do this, I had an error that I should add .encode('utf-8') for saving extracted text in a .txt file. So, I added txt = str(txt.encode('utf-8'))
. The problem is reading the .txt files, when I read the .txt files, they have special characters due to UTF-8, I don't know how I can have the main text without that characters. I applied to decode but it didn't work.
I applied another approach to avoid saving in .txt format, I was going to save the extracted text in a data frame, but I found that the few first pages were saved in data frame!
I would appreciate it if you could share your solutions to read from the .txt file and removing characters relating to encoding ('utf-8') and how I can save the extracted text in a data frame.
import pdfplumber
import pandas as pd
import codecs
txt = ''
with pdfplumber.open(r'C:\Users\thmag\3rdPaperLDA\A1.pdf') as pdf:
pages = pdf.pages
for i, pg in enumerate (pages):
txt += pages [i].extract_text()
print (txt)
data = {'text': [txt]}
df = pd.DataFrame(data)
####write in .txt file
text_file = open("Test.txt", "wt")
txt = str(txt.encode('utf-8'))
n = text_file.write(txt)
text_file.close()
####read from .txt file
with codecs.open('Test.txt', 'r', 'utf-8') as f:
for line in f:
print (line)
You are writing the file incorrectly. Rather than encoding the text, declare an encoding when you open the file, and write the text without encoding - Python will automatically encode it.
It should be
####write in .txt file
with open("Test.txt", "wt", encoding='utf-8') as text_file:
n = text_file.write(txt)
Unless you are using Python 2 you don't need to use codecs to open encoded files, again you can declare the encoding in the open
function:
with open("Test.txt", "rt", encoding='utf-8') as f:
for line in f:
print(line)