In pentaho kettle, I configured the RSS Input step with some URLs. When I run the transformation, it runs perfect most of the times but sometimes, it shows the following error:
2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Unexpected Exception : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 - at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:53)
2016/06/29 13:10:48 - RSS Input.0 - at org.pentaho.di.trans.steps.rssinput.RssInput.readNextUrl(RssInput.java:168)
2016/06/29 13:10:48 - RSS Input.0 - at org.pentaho.di.trans.steps.rssinput.RssInput.getOneRow(RssInput.java:198)
2016/06/29 13:10:48 - RSS Input.0 - at org.pentaho.di.trans.steps.rssinput.RssInput.processRow(RssInput.java:312)
2016/06/29 13:10:48 - RSS Input.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2016/06/29 13:10:48 - RSS Input.0 - at java.lang.Thread.run(Thread.java:745)
2016/06/29 13:10:48 - RSS Input.0 - Caused by: org.dom4j.DocumentException: Error on line -1 of document : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 - at org.dom4j.io.SAXReader.read(SAXReader.java:482)
2016/06/29 13:10:48 - RSS Input.0 - at org.dom4j.io.SAXReader.read(SAXReader.java:291)
2016/06/29 13:10:48 - RSS Input.0 - at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:37)
2016/06/29 13:10:48 - RSS Input.0 - ... 5 more
I have used the default RSS Input step that comes with kettle, and here is the screenshot:
And the links that I have configured in RSS feed are:
How to resolve this issue? Even when I run the RSS feed on one of the links, it shows the same error occasionally. Is there some problem with this plugin?
If it is really necessary manually adjust source code.
Just get source of feed4j. It is quiet old, so there is just single version.
Open file in editor it.sauronsoftware.feed4j.FeedParser.java
It has single method parse
public static Feed parse(Url url){
SAXReader saxReader = new SAXReader();
Document document = saxReader.read(url);
...
Good staff SAXReader has several overloaded method, one on them what u need
saxParser.read(InputStream is)
Instead of passing url to method read, just write code to read data from url using httpclient (good news it is bundled with kettle-pdi but to clarify version look into $KETTLE-HOME/lib/commons-httpclient-x.x.jar)
Then wrap received from server by httpclient data into ByteArrayInputSteam and pass it into SaxReader
Build library and replace feed4j-1.0.jar with yours
And u are done.
code will something like this
public static Feed parse(Url url){
SAXReader saxReader = new SAXReader();
CloseableHttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet(url);
CloseableHttpResponse response = client.execute(get);
HttpEntity entity = response.getEntity();
byte[] b = new byte[(int)entity.getContentLength()];
entity.getContent().read(b);
InputStream is = new ByteArrayInputStream(b);
Document document = saxReader.read(is);
...
Extra details