Search code examples
python-3.xpython-newspaper

Parse HTML String from MySQL in Newspaper3k


I have a MySQL Table full of crawled news article HTML data. I would like to extract article texts with newspaper3k module which I have done many times before.

The only difference now is that I am not extracting an URL and parse the result with Newspaper but I pull raw HTML strings from a MySQL DB.

Somehow Newspaper (or Goose) doesn't like the string from the DB as the returned article.text is always ''.

However when I use a URL with requests.get and feed the raw HTML to Newspaper it works. So I'm guessing that the data from MySQL is formatted/encoded differently so that Newspaper does not understand it as HTML?!

When I print data from the DB it looks like:

<!DOCTYPE html>\n<html lang="de">\n<head>\n\n<...

While the html via requests.get looks like:

<!DOCTYPE html>
<html lang="de">
<head>

<meta charset="utf-8">
<!-- 
    This website is powered by TYPO3 - inspiring people to share!
    TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
    TYPO3 is copyright 1998-2016 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
    Information and contribution at http://typo3.org/
--> ...

Solution

  • You get the header of a TYPO3 page. Maybe the default 404 page. (get the complete HTML)

    If your request should be served by anything else than TYPO3 you miss the (htaccess-)configuration (by default TYPO3 answeres every request as long as there is no static file with the URL-request path)

    Or you expect a TYPO3 server to answer you with something else than a complete page (AJAX: HTML-Snippet or JSON?)?
    Then you probably have not the correct configuration in TYPO3 to omit headers.

    As TYPO3 is involved you might tag your question also with TYPO3