Search code examples
androidhtmlscreen-scrapingjsoup

Android - Jsoup HTML scraping


I scoured this site for the last two days looking for a reason why this will not function properly! I am trying to get the workout of the day on crossfit.com, when I run the program it displays a white screen for a while then crashes. Please advise me on what is wrong here!!

public class MainActivity extends Activity {
/** Called when the activity is first created. */
@Override
public void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);

    TextView tv = (TextView) findViewById(R.id.textView1);

    Document doc = null;
    try {
        doc = Jsoup.connect("http://www.crossfit.com").get();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    Element content = doc.getElementsByClass("blogbody").first();
    System.out.println(content.text());
    tv.setText(content.text());
}
}

Logcat:

08-04 00:14:08.105: E/AndroidRuntime(339): FATAL EXCEPTION: main
08-04 00:14:08.105: E/AndroidRuntime(339): java.lang.OutOfMemoryError
08-04 00:14:08.105: E/AndroidRuntime(339):  at java.util.ArrayList.add(ArrayList.java:123)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.nodes.Node.addChildren(Node.java:411)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.nodes.Element.appendChild(Element.java:267)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilder.insertNode(HtmlTreeBuilder.java:204)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilder.insertEmpty(HtmlTreeBuilder.java:173)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilderState$7.process(HtmlTreeBuilderState.java:443)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:89)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilderState$15.anythingElse(HtmlTreeBuilderState.java:1197)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilderState$15.process(HtmlTreeBuilderState.java:1191)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:84)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:48)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:41)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:37)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.parser.Parser.parseInput(Parser.java:30)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:101)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:469)
08-04 00:14:08.105: E/AndroidRuntime(339):  at org.jsoup.helper.HttpConnection.get(HttpConnection.java:147)
08-04 00:14:08.105: E/AndroidRuntime(339):  at com.example.lookingfor.MainActivity.onCreate(MainActivity.java:36)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1047)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:1611)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:1663)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.ActivityThread.access$1500(ActivityThread.java:117)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.ActivityThread$H.handleMessage(ActivityThread.java:931)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.os.Handler.dispatchMessage(Handler.java:99)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.os.Looper.loop(Looper.java:123)
08-04 00:14:08.105: E/AndroidRuntime(339):  at android.app.ActivityThread.main(ActivityThread.java:3683)
08-04 00:14:08.105: E/AndroidRuntime(339):  at java.lang.reflect.Method.invokeNative(Native Method)
08-04 00:14:08.105: E/AndroidRuntime(339):  at java.lang.reflect.Method.invoke(Method.java:507)
08-04 00:14:08.105: E/AndroidRuntime(339):  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:839)
08-04 00:14:08.105: E/AndroidRuntime(339):  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:597)
08-04 00:14:08.105: E/AndroidRuntime(339):  at dalvik.system.NativeStart.main(Native Method)

Solution

  • Well, .. Jsoup is not flawless. While it's one-size-fits all downloads seem appealing you're sometimes better off downloading the stream yourself.

    URL url = new URL("http://www.crossfit.com");
    url.openConnection();
    InputStream is = url.openStream();
    byte[] b = new byte[8192];
    int count;
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    while ((count = is.read(b)) != -1) {
        os.write(b, 0, count);
    }
    is.close();
    doc = Jsoup.parse(os.toString("UTF-8"));
    

    It's a pretty huge page - shitting out about 70k of HTML per pop is not going to sit well with a DOM-like model in any small memory print device. Loading it yourself could work .. on most devices.. otherwise - look into a streaming solution using TagSoup instead.