Search code examples
pythonweb-scrapingreadlinespython-newspaper

How to input a list of URLs saved in a .txt to a Python program?


I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?

I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?

Extract from the code, lines 1-18:

from newspaper import Article
from newspaper import fulltext
import requests


url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary

Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?

This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.


Solution

  • This could help you:

    url_file = open('myfile.txt','r')
    for url in url_file.readlines():
       print url
    url_file.close()
    

    You can apply it on your code as the following

    from newspaper import Article
    from newspaper import fulltext
    import requests
    
    url_file = open('myfile.txt','r')
    for url in url_file.readlines():
      a = Article(url, language='pt')
      html = requests.get(url).text
      text = fulltext(html)
      download = a.download()
      parse = a.parse()
      nlp = a.nlp()
      title = a.title
      publish_date = a.publish_date
      authors = a.authors
      keywords = a.keywords
      summary = a.summary
    url_file.close()