Search code examples
pythonpandasweb-scrapingmultiple-columnsweb-inspector

Trying to scrape from pages with Python and put this info into a csv, getting only the results for the last element of the list


I'm trying to scrape from multiple Ballotpedia pages with Python and put this info into a csv, but am only getting the results for the last element of the list. Here is my code:

import pandas as pd

list = ['https://ballotpedia.org/Alaska_Supreme_Court', 
'https://ballotpedia.org/Utah_Supreme_Court']

for page in list:
    frame = pd.read_html(page,attrs={"class":"wikitable 
sortable jquery-tablesorter"})[0]

    frame.drop("Appointed By", axis=1, inplace=True)

frame.to_csv("18-TEST.csv", index=False)

I've been playing around with adding and deleting parts of the last line of the code but the issue remains. The first element of the list must be getting added to the csv but them gets replaced by the second element. How can I get both to show up on the csv at the same time? Thank you very much!


Solution

  • Every iteration resets your frame variable so it gets thrown away. You'll have to accumulate the entries all in one dataframe to save it all as one csv. Also, like piterbarg mentioned, list is a reserved word in Python. It's not breaking your code but it is bad practice ;).

    import pandas as pd
    
    # better variable name "pages"
    pages = ['https://ballotpedia.org/Alaska_Supreme_Court',
             'https://ballotpedia.org/Utah_Supreme_Court']
    
    # dataframe outside the loop to accumulate everything in
    judges = pd.DataFrame()
    
    for page in pages:
        frame = pd.read_html(page, attrs={'class': 'wikitable sortable jquery-tablesorter'})[0]
        frame.drop('Appointed By', axis=1, inplace=True)
        # add this particular page's data to the main dataframe
        judges = judges.append(frame, ignore_index=True)
        # ignore_index ignores the indices from the frame we're adding,
        # so the indices in the judges frame are continuous
    
    # after the loop, save the complete dataframe to a csv
    judges.to_csv('18-TEST.csv', index=False)
    

    This will save it all in one csv. Give that a try!