Search code examples
javaweb-scrapingjsoup

Scraping web with java and downloading a video


I'm trying to scraping this 9gag link

I tried using JSoup to get this HTML tag for taking the source link and download the video directly.

I tried with this code

    public static void main(String[] args) throws IOException {
        Response response= Jsoup.connect("https://9gag.com/gag/a2ZG6Yd")
                   .ignoreContentType(true)
                   .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
                   .referrer("https://www.facebook.com/")
                   .timeout(12000) 
                   .followRedirects(true)
                   .execute();

        Document doc = response.parse();
        System.out.println(doc.getElementsByTag("video"));
    }

but I get nothing

I tried then this

    public static void main(String[] args) throws IOException {
        Response response= Jsoup.connect("https://9gag.com/gag/a2ZG6Yd")
                   .ignoreContentType(true)
                   .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
                   .referrer("https://www.facebook.com/")
                   .timeout(12000) 
                   .followRedirects(true)
                   .execute();

        Document doc = response.parse();
        System.out.println(doc.getAllElements());
    }

and I noticed that in the HTML there is not the tag I'm looking for, as if the page is loaded dynamically and the tag "video" is not loaded yet

What could I do? Thank you all 😊


Solution

  • Let's reverse the approach. You already know we're looking for URL like https://img-9gag-fun.9cache.com/photo/a2ZG6Yd_460svvp9.webm (To obtain URL of the video you could also right click it in Chrome and select "Copy video address").

    If you search page source you will find a2ZG6Yd_460svvp9.webm but it's stored in JSON inside <script>.

    enter image description here

    That's not a good news for Jsoup because it can't be parsed, but we can use simple regular expression to get this link. The URL is escaped so we have to remove backslashes. Then you can use Jsoup to download the file.

        public static void main(String[] args) throws IOException {
            Document doc = Jsoup.connect("https://9gag.com/gag/a2ZG6Yd").ignoreContentType(true)
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                    .referrer("https://www.facebook.com/").timeout(12000).followRedirects(true).get();
    
            String html = doc.toString();
    
            Pattern p = Pattern.compile("\"vp9Url\":\"([^\"]+?)\"");
            Matcher m = p.matcher(html);
            if (m.find()) {
                String escpaedURL = m.group(1);
                String correctUrl = escpaedURL.replaceAll("\\\\", "");
                System.out.println(correctUrl);
                downloadFile(correctUrl);
            }
        }
    
        private static void downloadFile(String url) throws IOException {
            FileOutputStream out = (new FileOutputStream(new File("C:\\file.webm")));
            out.write(Jsoup.connect(url).ignoreContentType(true).execute().bodyAsBytes());
            out.close();
        }
    

    Also note that vp9Url is not the only one there, so maybe the other one will be more suitable, for example h265Url or webpUrl.