Search code examples
javarssboilerpipe

Not able to parse new york times article using boilerpipe


I am trying to get news article from 'new york times' url but it is not giving any output, but if I try for any other newspaper it gives output. I want to know if something is wrong with my code or boilerpipe is not able to fetch it. Plus sometimes the output is not in english language means it shows in unicode mainly for 'daily news', I want to know reason for that also. import java.io.InputStream; import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;

class ExtractData
{
    public static void main(final String[] args) throws Exception 
    {
        URL url;
        url = new URL(
                "http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");

        // NOTE We ignore HTTP-based character encoding in this demo...
        final InputStream urlStream = url.openStream();
        final InputSource is = new InputSource(urlStream);
        final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
        final TextDocument doc = in.getTextDocument();
        urlStream.close();

        // You have the choice between different Extractors

        //System.out.println(DefaultExtractor.INSTANCE.getText(doc));
        System.out.println(ArticleExtractor.INSTANCE.getText(doc));
    }
}

Solution

  • Nytimes.com has a paywall and it returns HTTP 303 for your request, you could try to handle the redirect and cookies. Trying other user-agent strings might also work.