Search code examples
pythonexcelpandasdataframedata-mining

Cannot export to ".csv" file - pandas.DataFrame


I would like to seek help with regards to my Google Colaboratory Notebook. The error is located in fourth cell.

Context:
We're performing Web scraping BTC's Historical Data.

Here's my codes:

First cell (Executed successfully)

#importing libaries
from bs4 import BeautifulSoup
import requests
import pandas as pd

Second cell (Executed successfully)

#sample url
url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
#request the page
page = requests.get(url)
#creating a soup object and the parser
soup = BeautifulSoup(page.text, 'lxml')

#creating a table body to pass on the soup to find the table
table_body = soup.find('table')
#creating an empty list to store information
row_data = []

#creating a table 
for row in table_body.find_all('tr'):
  col = row.find_all('td')
  col = [ele.text.strip() for ele in col ] # stripping the whitespaces
  row_data.append(col) #append the column

# extracting all data on table entries
df = pd.DataFrame(row_data)
df

Third cell (Executed successfully)

headers = []
for i in soup.find_all('th'):
  col_name = i.text.strip().lower().replace(" ", "_")
  headers.append(col_name)
headers

Fourth cell (Execution failed)

df = pd.DataFrame(row_data, columns=headers)
df
#into a file 
df.to_csv('/content/file.csv')

The error! :(

AssertionError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
    563     try:
--> 564         columns = _validate_or_indexify_columns(content, columns)
    565         result = _convert_object_array(content, dtype=dtype, coerce_float=coerce_float)
AssertionError: 13 columns passed, passed data had 7 columns

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
    565         result = _convert_object_array(content, dtype=dtype, coerce_float=coerce_float)
    566     except AssertionError as e:
--> 567         raise ValueError(e) from e
    568     return result, columns
    569 

ValueError: 13 columns passed, passed data had 7 columns

Solution

  • To load the table you can use simple pd.read_html(). For example:

    import pandas as pd
    
    url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
    
    df = pd.read_html(url)[0]
    print(df)
    df.to_csv("data.csv")
    

    Creates data.csv (screenshot from LibreOffice):

    enter image description here


    To correct your example:

    # importing libaries
    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    
    # sample url
    url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
    # request the page
    page = requests.get(url)
    # creating a soup object and the parser
    soup = BeautifulSoup(page.text, "lxml")
    
    # creating a table body to pass on the soup to find the table
    table_body = soup.find("table")
    # creating an empty list to store information
    row_data = []
    
    # creating a table
    for row in table_body.select("tr:has(td)"):
        col = row.find_all("td")
        col = [ele.text.strip() for ele in col]  # stripping the whitespaces
        row_data.append(col)  # append the column
    
    # extracting all data on table entries
    df = pd.DataFrame(row_data)
    
    headers = []
    for i in table_body.select("th"):
        col_name = i.text.strip().lower().replace(" ", "_")
        headers.append(col_name)
    
    df = pd.DataFrame(row_data, columns=headers)
    print(df)
    df.to_csv("/content/file.csv")