what Code DO's
I am trying to read each file from the folder which i have given ,And extracting some line using bs4 Soup package in python.
I got an error reading the file that some unicode chars not able to read.
error
Traceback (most recent call last): File "C:-----\check.py", line 25, in soup=BeautifulSoup(text.read(), 'html.parser') File "C:\Python\Python37\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3565: character maps to
from bs4 import BeautifulSoup
from termcolor import colored
import re, os
import requests
path = "./brian-work/"
freddys_library = os.listdir(path)
def getfiles():
for r, d, f in os.walk(path):
for file in f:
if '.html' in file:
files.append(os.path.join(r, file))
return files
for book in getfiles():
print("file is printed")
print(book)
text = open(book, "r")
soup=BeautifulSoup(text.read(), 'html.parser')
h1 = soup.select('h1')[0].text.strip()
print(h1)
if soup.find('h1'):
h1 = soup.select('h1')[0].text.strip()
else:
print ("no h1")
continue
filename1=book.split("/")[-1]
filename1=filename1.split(".")[0]
print(h1.split(' ', 1)[0])
print(filename1)
if h1.split(' ', 1)[0].lower() == filename1.split('-',1)[0] :
print('+++++++++++++++++++++++++++++++++++++++++++++');
print('same\n');
else:
print('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX');
print('not')
count=count+1
Please help what should i correct here.
Thanks
The problem is opening a file without knowing its encoding. The default encoding for text = open(book, "r")
, per open docs, is the value returned from locale.getpreferredencoding(False)
, which is cp1252
for your system. The file is some other encoding, so it fails.
Use text = open(book, "rb")
(binary mode) and let BeautifulSoup figure it out. HTML files usually indicate their encoding.
You can also use text = open(book,encoding='utf8')
or whatever the correct encoding is if you know it already.