Search code examples
pythonflaskpython-newspaper

Python Flask app returns different (crawled) string than python directly


I have found a weird thing within a Flask app I am working on. The Flask API is meant to receive a news article url, crawl it (useing newspaper library) and predict a category for the crawled text.

However, wehn I run the Crawler directly in Python (Spyder) it returns the Article text, as expected.

from newspaper import Article

url='https://www.dev-insider.de/index.cfm?pid=15010&pk=676039'
article = Article(str(url) , browser_user_agent = 'Chrome', http_success_only=False)
article.download()
article.parse()
print(article.text)

This works like a charm. If I now run that same piece of code within the Flask App, it yields some other string that belongs to the Navigation of the Crawled url:

from flask import Flask
from newspaper import Article
from flask import request

app = Flask(__name__)
app.config['JSON_AS_ASCII'] = False
app.config['MAX_CONTENT_LENGTH'] = 1000000

#url='https://www.dev-insider.de/index.cfm?pid=15010&pk=676039'
@app.route('/test')
def bla():
    url = request.args.get('url')    
    article = Article(str(url) , browser_user_agent = 'Chrome', http_success_only=False)
    article.download()
    article.parse()
    text_raw = article.text
    return text_raw

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Basically the first snippet returns the complete article text, while the secon snippet returns:

Sie befinden sich hier: DevOps > Configuration-Management Sie sind noch nicht angemeldet Login | Registrierung | Newsletter

I hope I made the problem clear enough. Let me know if otherwise.

Any Ideas whats going on?


Solution

  • If you are passing the url as a query string you need to make sure that the url is properly encoded and in your code decoded again. That means you'd call the app with:

    http://localhost/test?=https%3A%2F%2Fwww.dev-insider.de%2Findex.cfm%3Fpid%3D15010%26pk%3D676039
    

    as far as I know flask already decodes query strings for you so it should be fine and you wouldn't need to decode yourself.

    The specification for URLs describes how a URL should be formated. If you just paste a URL without encoding it basically breaks the formatting.

    RFC-1738 says:

    An HTTP URL takes the form:

     http://<host>:<port>/<path>?<searchpart>