Search code examples
androidjsoupweb-scraping

How do I extract data from multiple related web pages in Android using Jsoup?


Well, I have been working in a app to display news headings and contents from the site http://www.myagdikali.com

I am able to extract the data from 'myagdikali.com/category/news/national-news/' but there are only 10 posts in this page and there are links to other pages as 1,2,3... like myagdikali.com/category/news/national-news/page/2.

All I need to know is, how do I extract news from every possible pages under /national_news ? Is it even possible using Jsoup ?

Till now my code to extract data from a single page is:

public View onCreateView(LayoutInflater inflater, ViewGroup container,
                         Bundle savedInstanceState) {
    View rootView = inflater.inflate(R.layout.fragment_all, container, false);
    int i = getArguments().getInt(NEWS);
    String topics = getResources().getStringArray(R.array.topics)[i];

    switch (i) {
        case 0:
            url = "http://myagdikali.com/category/news/national-news";
            new NewsExtractor().execute();

            break;
            .....


[EDIT]
private class NewsExtractor extends AsyncTask<Void, Void, Void> {
   String title;

@Override
protected Void doInBackground(Void... params) {

    while (status == OK) {
        currentURL = url + String.valueOf(page);


        try {
            response = Jsoup.connect(currentURL).execute();
            status = response.statusCode();
            if (status == OK) {

                Document doc = response.parse();
                Elements urlLists = doc.select("a[rel=bookmark]");
                for (org.jsoup.nodes.Element urlList : urlLists) {

                    String src = urlList.text();

                    myLinks.add(src);

                }
                title = doc.title();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    page++;

    }
    return null;




}

EDIT: While trying to extract data from single page without loop, I can extract the data. But after using while loop, I get the error stating No adapter attached.

Actually I am loading the extracted data in the RecyclerView and onPostExecute is like this:

    @Override
    protected void onPostExecute(Void aVoid) {
        layoutManager = new LinearLayoutManager(getActivity());
        recyclerView.setLayoutManager(layoutManager);

        myRecyclerViewAdapter = new     MyRecyclerViewAdapter(getActivity(),myLinks);
        recyclerView.setAdapter(myRecyclerViewAdapter);


    }

Solution

  • Since you know the URL of the pages you need - http://myagdikali.com/category/news/national-news/page/X (where X is the page number between 2 and 446), you can loop through the URLs. You'll also need to use the Jsoup's response, to make sure that the page exists (the number 446 can be changed - I believe that it increases).
    The code should be something like this:

    final String URL = "http://myagdikali.com/category/news/national-news/page/";
    final int OK = 200;
    String currentURL;
    int page = 2;
    int status = OK;
    Connection.Response response = null;
    Document doc = null;
    
    while (status == OK) {
        currentURL = URL + String.valueOf(page);  //add the page number to the url
        response = Jsoup.connect(currentURL)
                .userAgent("Mozilla/5.0")
                .execute();  //you may add here userAgent/timeout etc.
        status = response.statusCode();
        if (status == OK) {
            doc = response.parse();
            //extract the info. you need
        }
        page++;
    }
    

    This is of course not fully working code - you'll have to add try-catch sentences, but the compiler will help you. Hope this helps you.

    EDIT:
    1. I've editted the code - I've had to send a userAgent string in order to get response from the server.
    2. The code runs on my machine, it prints lots of ????, because I don't have the proper fonts installed.
    3. The error you're getting is from the Android part - something to do with your views. You haven't posted that piece of code...
    4. Try to add the userAgent, it might solve it.
    5. Please add the error and the code you're running to the original question by editting it, it's much more readable.