Herewith I get a table from the site: http://www.bvbf.ru/food_paper.html
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.bvbf.ru/food_paper.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', id='table1')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
mydata = pd.DataFrame(columns = headers)
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata = mydata.append(pd.Series(row, index=mydata.columns[:len(row)]), ignore_index=True)
I get the dataframe, but get in it делий, бÑÐ. Same characters are in the html code of the page. Is there a way to convert them to the way how they appear on the site (Cyrillic letters)?
Just change this line:
soup = BeautifulSoup(page.text, 'lxml')
to this:
soup = BeautifulSoup(page.content, 'lxml')
You should get something like this:
+----+---------------------------------+---------------------+-----------------------------------+----------------------+
| | Наименование продукции | Плотность (гр/м2) | Формат (мм) | Единица измерения |
|----+---------------------------------+---------------------+-----------------------------------+----------------------|
| 0 | Оберточная бумага марки «Д» | 40-60 | 840 | тн |
| 1 | 80 | 840 | тн | nan |
| 2 | 40-60 | 1050 | тн | nan |
| 3 | 80-120 | 1050 | тн | nan |
| 4 | Бумаг оберточная / влагостойкая | 80 | 103, 107, 121, 125, 118, 120, 123 | тн |
+----+---------------------------------+---------------------+-----------------------------------+----------------------+