Search code examples
web-scrapinghtml-content-extractionboilerpipe

how to run and get document stats from boilerpipe article extractor?


There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault.

anyhow, I'm trying to use boilerpipe to extract the main article from some raw html source I have collected. The html source text is stored in a java.lang.String variable (let's call it htmlstr) variable that has the raw HTML contents of a webpage.

I know how to run boilerpipe to print the extracted text to the output window as follows:

java.lang.String htmlstr = "<!DOCTYPE.... ****html source**** ... </html>";

java.lang.String article = ArticleExtractor.INSTANCE.getText(htmlstr);
System.out.println(article);

However, I'm not sure how to run BP by first instantiating an instance of the ArticleExtractor class, then calling it with the 'TextDocument' input datatype. The TextDocument datatype is itself somehow constructed from BP's 'TextBlock' datatype, and perhaps I am not doing this correctly...

What is the proper way to construct a TextDocument type variable from my htmlstr string variable?

So my problem is then in using the processing method of BP's Article Extractor class aside from calling the ArticleExtractor getText method as per the example above. In other words, I'm not sure how to use the

ArticleExtractor.process(TextDocument doc);

method.

It is my understanding that one is required to run this ArticleExtractor process method to then be able to use the same "TextDocument doc" variable for getting document stats, using BP's

TextDocumentStatistics(TextDocument doc, boolean contentOnly) 

method? I would like to use the stats to determine how good the filtering was estimated to be.

Any code examples someone could help me out with?


Solution

  • Code written in Jython (Conversion to java should be easy)

    1) How to get TextDocument from a HTML String:

    import org.xml.sax.InputSource as InputSource
    import de.l3s.boilerpipe.sax.HTMLDocument as HTMLDocument
    import de.l3s.boilerpipe.document.TextDocument as TextDocument
    import de.l3s.boilerpipe.sax.BoilerpipeSAXInput as BoilerpipeSAXInput
    import de.l3s.boilerpipe.extractors.ArticleExtractor as ArticleExtractor
    import de.l3s.boilerpipe.estimators.SimpleEstimator as SimpleEstimator
    import de.l3s.boilerpipe.document.TextDocumentStatistics as TextDocumentStatistics
    import de.l3s.boilerpipe.document.TextBlock as TextBlock
    
    htmlDoc = HTMLDocument(rawHtmlString)
    inputSource = htmlDoc.toInputSource() 
    boilerpipeSaxInput = BoilerpipeSAXInput(inputSource)
    textDocument = boilerpipeSaxInput.getTextDocument()
    

    2) How to process TextDocument using Article Extractor (continued from above)

    content = ArticleExtractor.INSTANCE.getText(textDocument)  
    

    3) How to get TextDocumentStatistics (continued from above)

    content_list = [] #replace python 'List' Object with ArrayList in java
    content_list.append(TextBlock(content)) #replace with arrayList.add(TextBlock(content))
    content_td = TextDocument(content_list)
    content_stats = TextDocumentStatistics(content_td, True)#True for article content statistics only
    

    Note: The java docs accompanied with the boilerpipe 1.2.jar library should be somewhat useful for future reference