I'm trying to scrape a data from a table on a website. However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data). I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html
, which returns a list of DataFrames after parsing every table
tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table
tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered]
part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'