Search code examples
pythonbeautifulsoupencoding

Decoding non-latin characters when parsing a site with BeautifulSoup


Herewith I get a table from the site: http://www.bvbf.ru/food_paper.html

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.bvbf.ru/food_paper.html'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', id='table1')
headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
mydata = pd.DataFrame(columns = headers)
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata = mydata.append(pd.Series(row, index=mydata.columns[:len(row)]), ignore_index=True)

I get the dataframe, but get in it делий, бÑÐ. Same characters are in the html code of the page. Is there a way to convert them to the way how they appear on the site (Cyrillic letters)?


Solution

  • Just change this line:

    soup = BeautifulSoup(page.text, 'lxml')
    

    to this:

    soup = BeautifulSoup(page.content, 'lxml')
    

    You should get something like this:

    +----+---------------------------------+---------------------+-----------------------------------+----------------------+
    |    | Наименование продукции          | Плотность (гр/м2)   | Формат (мм)                       | Единица измерения    |
    |----+---------------------------------+---------------------+-----------------------------------+----------------------|
    |  0 | Оберточная бумага марки «Д»     | 40-60               | 840                               | тн                   |
    |  1 | 80                              | 840                 | тн                                | nan                  |
    |  2 | 40-60                           | 1050                | тн                                | nan                  |
    |  3 | 80-120                          | 1050                | тн                                | nan                  |
    |  4 | Бумаг оберточная / влагостойкая | 80                  | 103, 107, 121, 125, 118, 120, 123 | тн                   |
    +----+---------------------------------+---------------------+-----------------------------------+----------------------+