Search code examples
javajsoup

Get all img src with Jsoup


I've html code with following img src parts:

<img src="https://lh3.googleusercontent.com/...rw" srcset="https://lh3.googleusercontent.com/...rw 2x" class="T75of DYfLw" width="551" height="310" alt="Screenshot Image"">
<img data-src="https://lh3.googleusercontent.com/...w720-h310-rw" ... data-srcset="https://lh3.googleusercontent.com/... w1440-h620-rw 2x" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="width="551" height="310" alt="Screenshot Image">

I want to get all screenshots with attribute alt=Screenshot Image. So I need the value inside attribute srcset and data-srcset (2 different attribute names = 2 different cases).

I wrote this code:

List<String> src = htmlDocument.select("img[src]").stream()
                .filter(img -> img.attr("alt").equals("Screenshot Image"))
                .map(element -> element.absUrl("data-srcset").replace("2x", ""))
                //or for 1st case
                .map(element -> element.absUrl("srcset")..
                //
                .collect(Collectors.toList());

But now I can't get this value from first case, where this attribute is srcset, not data-srcset. Can I get srcs for both scenarios without additional iteration - like not to create another stream and then unite all results into one collection? Maybe some regex and another method (seems like .absUrl doesn't work with regex) in Jsoup library can help?

And I don't like the part with replace (maybe some src will contain 2x as own part).

.map(element -> element.absUrl("data-srcset").replace("2x", ""))

But without this manipulation I'll get non-correct src.

https://lh3.googleusercontent.com/Z...=w1440-h620-rw 2x

Can I improve this replace solution with smth else?


Solution

  • You could try to create a collection of collections and then flatMap:

    List<String> src = htmlDocument.select("img[src]").stream()
                .filter(img -> img.attr("alt").equals("Screenshot Image"))
                .map(element -> {
                    List<String> url = new ArrayList<>();
                    url.add( element.absUrl("data-srcset").replace("2x", ""));
                    url.add( element.absUrl("srcset"));
                    return url;
                })
                .flatMap(List::stream)
                .collect(Collectors.toList());
    

    For your last answer, assuming your URLs don't contain white spaces you could use:

    StringUtils.substringBefore(element.absUrl("data-srcset")," ")
    

    EDIT:

    I assumed you could have both srcset and data-srcset in the same image. Reading again I end up with a better approach:

        List<String> src = htmlDocument.select("img[src]").stream()
                    .filter(img -> img.attr("alt").equals("Screenshot Image"))
                    .map(element -> StringUtils.isNotEmpty(element.absUrl("srcset")) ? 
                       element.absUrl("srcset") : 
                       element.absUrl("data-srcset").replace("2x", ""))
                    .collect(Collectors.toList());