Search code examples
pythondataframeweb-scrapingpython-newspaper

I am trying to extract data from a website in python


def convert():
    for url in url_list:
        news=Article(url)
        news.download()
        while news.download_state != 2:
            time.sleep(1)
        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )

convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

The function convert() goes through a list of url and process each of them. Each url is a link to an article. I am fetching the important attributes of articles such as author, text etc and then storing this in a data frame. After that, I am converting data frame to a csv file. The script ran for about 5 hours as there were 589 urls in url_list. But I still couldn't get the csv file. Can somebody spot out where I am going wrong.


Solution

  • probably your function stops here:

        while news.download_state != 2:
            time.sleep(1)
    

    it is waiting for the change of the download state but it never happens. your function should also return a list

    something like this should work:

    def convert():
        for url in url_list:
            news=Article(url)
            news.download()
    
            news.parse()
            l.append(
                {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
            )
        return l 
    
    l = convert()
    df = pd.DataFrame.from_dict(l)
    df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)