Search code examples
javajsoup

How to scrape train route data from https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp


i am trying get list of intermediate railway stations information from https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp, by providing source and destination stations it displays list of intermediate stations within a table. but its hiding some intermediate stations under several buttons to limit the size of the table,i think. on clicking the buttons, it pushes hidden data on to the table. using jsoup i could get initial data in the table. but dont know how to get the hidden data. on button click, one javascript function requesting data using POST method from https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet by passing "route=inter,index=1,distance=goods,PageName=ShortPath" as parameters and the response is in json. as the parameters are not relevant to the displayed table, i can not make direct request to the https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet. enter image description here

        private void shortestPath(String source, String destination) {

        Document doc;
        try {
            doc = Jsoup.connect(url)
                    .data("srcCode", source.toUpperCase())
                    .data("destCode", destination.toUpperCase())
                    .data("guageType", "S")
                    .data("transhipmentFlag", "false")
                    .data("distance", "goods")
                    .post();
            Element table = doc.select("tbody").get(0);
            Elements rows = table.select("tr");
            stationCodeList = new String[rows.size() - 3];
            jsonPath = new JSONObject();
            for (int row = 3; row < rows.size(); row++) {
                JSONObject jsonObject = new JSONObject();
                Elements cols = rows.get(row).select("td");
                String code = cols.get(1).text();
                String name = cols.get(2).text();
                String cum_dist = cols.get(3).text();
                String inter_dist = cols.get(4).text();
                String gauge = cols.get(5).text();
                String carry_cap = cols.get(6).text();
               
                jsonObject.put("Code", code);
                jsonObject.put("Name", name);
                jsonObject.put("Cumulative Distance", cum_dist);
                jsonObject.put("inter Distance", inter_dist);
                jsonObject.put("Gauge Type", gauge);
                jsonObject.put("Carrying Capacity", carry_cap);
                jsonPath.put(code, jsonObject);
                stationCodeList[row - 3] = code;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        this.destination =new Station(stationCodeList[stationCodeList.length-1]);
    }

thank you in advance


Solution

  • If you take a look at this answer, you'll see how to get the exact same request the browser has made.

    The minimal and valid POST request to the StationXmlServlet, using your example, would look something like this with curl:

    curl --request POST 'https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet' \
      -H 'Content-Type: application/x-www-form-urlencoded' \
      -H 'Cookie: JSESSIONID1=0000ob7e89cT3vUAYkBxF6oyW4w:APP2SERV1' \
      --data-raw 'route=inter&index=1&distance=goods&PageName=ShortPath'
    

    As the parameters are not relevant to the displayed table, i can not make direct request to the https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet.

    I don't think that's true. The index in the body of the request is the zero-based index of rows in the master table.


    Solution

    It turns out that you simply have to follow the exact same order as you do when you use the page in a web browser. In other words, you have to first load the master table so that the site knows which table you are viewing when you want to query for details. A session cookie keeps track of this state.

    First, you open the landing page and get a Cookie:

    HttpRequest cookieRequest = HttpRequest.newBuilder()
        .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp"))
        .GET()
        .build();
    HttpResponse<String> cookieResponse =
        client.send(cookieRequest, BodyHandlers.ofString());
    String cookie = cookieResponse.headers().firstValue("Set-Cookie").get();
    

    Next, you load the master table, given the specified form parameters:

    HttpRequest masterRequest = HttpRequest.newBuilder()
        .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/ShortPathServlet"))
        .header("Content-Type", "application/x-www-form-urlencoded")
        .header("Cookie", cookie)
        .POST(BodyPublishers.ofString("srcCode=RGDA&destCode=JSWT&findPath0.x=42&findPath0.y=13&gaugeType=S&distance=goods&PageName=ShortPath"))
        .build();
    HttpResponse<String> masterResponse =
        client.send(masterRequest, BodyHandlers.ofString());
    String masterTableHTML = masterResponse.body();
    // Document masterTablePage = Jsoup.parse(masterTableHTML);
    // ...
    

    Finally, you can query the details for each row of the master table. In the example bellow, we query the details of the first row.

    HttpRequest detailsRequest = HttpRequest.newBuilder()
        .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet"))
        .header("Content-Type", "application/x-www-form-urlencoded")
        .header("Cookie", cookie)
        .POST(BodyPublishers.ofString("route=inter&index=0&distance=goods&PageName=ShortPath"))
        .build();
    HttpResponse<String> detailsResponse =
        client.send(detailsRequest, BodyHandlers.ofString());
    String jsonResponse = detailsResponse.body();
    System.out.println(jsonResponse);