Search code examples
androidxml-parsinghtml-parsingandroid-parser

How to collect (get and parse) the required information/data from a HTTP website?


I have a problem and unable to solve it since last two weeks. I want some help here. I actually want to get and use some useful data from a HTTP website. This website actually contains accidents, incidents and all info about them. I want to get this info from the website. I will use it in my Android app. I've already asked this question but still unable to solve it. Someone told me that you have to get this data from JSON. I have not done this before. If it is the only solution, then how can I do this. If any other simple way is there then please give me that. I actually have get all website content by using

private String DownloadText(String URL) {
    int BUFFER_SIZE = 2000;
    InputStream in = null;
    try {
        in = OpenHttpConnection(URL);
    } catch (IOException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
        return "exception in downloadText";
    }

    InputStreamReader isr = new InputStreamReader(in);
    int charRead;
    String str = "";
    char[] inputBuffer = new char[BUFFER_SIZE];          
    try {
        while ((charRead = isr.read(inputBuffer))>0)
        {                    
            //---convert the chars to a String---
            String readString = String.copyValueOf(inputBuffer, 0, charRead);
            str += readString;
            inputBuffer = new char[BUFFER_SIZE];
        }
        in.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        return "";
    }    
    return str;        
}

private InputStream OpenHttpConnection(String urlString) throws IOException {

    InputStream in = null;
    int response = -1;

    URL url = new URL(urlString); 
    URLConnection conn = url.openConnection();

    if (!(conn instanceof HttpURLConnection))                     
        throw new IOException("Not an HTTP connection");

    try{
        HttpURLConnection httpConn = (HttpURLConnection) conn;
        httpConn.setAllowUserInteraction(false);
        httpConn.setInstanceFollowRedirects(true);
        httpConn.setRequestMethod("GET");
        httpConn.connect(); 

        response = httpConn.getResponseCode();                 
        if (response == HttpURLConnection.HTTP_OK) {
            in = httpConn.getInputStream();                                 
        }                     
    }
    catch (Exception ex) {
        throw new IOException("Error connecting");            
    }
    return in;     
}

But it gives all the content i.e. all info+html+xml+++. But I want only required info.

Another thing is, is it compulsory to get website-admin permission before getting that data?


Solution

  • What you're looking for is something called web scraping or html scraping. Have a look at this SO question to get you started: Options for HTML scraping?