I have a MySQL Table full of crawled news article HTML data. I would like to extract article texts with newspaper3k module which I have done many times before.
The only difference now is that I am not extracting an URL and parse the result with Newspaper but I pull raw HTML strings from a MySQL DB.
Somehow Newspaper (or Goose) doesn't like the string from the DB as the returned article.text is always ''
.
However when I use a URL with requests.get and feed the raw HTML to Newspaper it works. So I'm guessing that the data from MySQL is formatted/encoded differently so that Newspaper does not understand it as HTML?!
When I print data from the DB it looks like:
<!DOCTYPE html>\n<html lang="de">\n<head>\n\n<...
While the html via requests.get looks like:
<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8">
<!--
This website is powered by TYPO3 - inspiring people to share!
TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
TYPO3 is copyright 1998-2016 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
Information and contribution at http://typo3.org/
--> ...
You get the header of a TYPO3 page. Maybe the default 404 page. (get the complete HTML)
If your request should be served by anything else than TYPO3 you miss the (htaccess-)configuration (by default TYPO3 answeres every request as long as there is no static file with the URL-request path)
Or you expect a TYPO3 server to answer you with something else than a complete page (AJAX: HTML-Snippet or JSON?)?
Then you probably have not the correct configuration in TYPO3 to omit headers.
As TYPO3 is involved you might tag your question also with TYPO3