Search code examples
javacommentsexcel-2013crawler4jjericho-html-parser

How to retrieve all the user comments from a site?


I want all the user comments from this site : http://www.consumercomplaints.in/?search=chevrolet

The problem is the comments are just displayed partially, and to see the complete comment I have to click on the title above it, and this process has to be repeated for all the comments.

The other problem is that there are many pages of comments.

So I want to store all the complete comments in an excel sheet from the above site specified. Is this possible ? I am thinking of using crawler4j and jericho along with Eclipse.

My code for visitPage method: @Override public void visit(Page page) {
String url = page.getWebURL().getURL(); System.out.println("URL: " + url);

           if (page.getParseData() instanceof HtmlParseData) {
                   HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();

                   String html = htmlParseData.getHtml();

  //               Set<WebURL> links = htmlParseData.getOutgoingUrls();
  //               String text = htmlParseData.getText();

                   try
                   {
                       String CrawlerOutputPath = "/DA Project/HTML Source/";
                       File outputfile = new File(CrawlerOutputPath);

                       //If file doesnt exists, then create it
                        if(!outputfile.exists()){
                            outputfile.createNewFile();
                        }

                       FileWriter fw = new FileWriter(outputfile,true);  //true = append file
                       BufferedWriter bufferWritter = new BufferedWriter(fw);
                       bufferWritter.write(html);
                       bufferWritter.close();
                       fw.write(html);
                       fw.close();

                   }catch(IOException e)
                   {
                       System.out.println("IOException : " + e.getMessage() );
                       e.printStackTrace();
                   }

                   System.out.println("Html length: " + html.length());
           }
   }

Thanks in advance. Any help would be appreciated.


Solution

  • Yes it is possible.

    • Start crawling on your search site (http://www.consumercomplaints.in/?search=chevrolet)
    • Use the visitPage method of crawler4j to only follow comments and the ongoing pages.
    • Take the html Content from crawler4j and shove it to jericho
    • filter out the content you want to store and write it to some kind of .csv or .xls file (i would prefer .csv)

    Hope this helps you