Search code examples
pythonregexbeautifulsoupfindall

How to find all strings by bs4?


I want to parse a specific page with some images, but images are not in a fixed tag a, here are some examples:

<meta name="description" content="This is Text."><meta name="Keywords" content="Weather"><meta property="og:type" content="article"><meta property="og:title" content="Cloud"><meta property="og:description" content="Testing"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/"><meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/"><link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html"><script async="async" src="https://www.googletagservices.com/tag/js/gpt.js"></script>
<img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
<img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518">

I tried to use code as below to get all images, but no any results, what can I do?

soup.find_all(string=re.compile(r"(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+"))

Solution

  • I personally think this is one of the rare cases when applying a regular expression to the complete document without using an HTML parser is the easiest and a good way to go. And, since you are actually just looking for URLs and not matching any HTML tags in the regular expression, points made in this thread are not valid for this case:

    In [1]: data = """
       ...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
       ...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
       ...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
       ...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
       ...: ">
       ...: """
    
    In [2]: import re
    
    In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")
    
    In [4]: pattern.findall(data)
    Out[4]: 
    ['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
    

    If you are though interested in how would you apply a regular expression pattern to multiple attributes in BeautifulSoup, it may be something along these lines (not pretty, I know):

    In [6]: results = soup.find_all(lambda tag: any(pattern.search(attr) for attr in tag.attrs.values()))
    
    In [7]: [next(attr for attr in tag.attrs.values() if pattern.search(attr)) for tag in results]
    Out[7]: 
    [u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
     u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
     u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
    

    Here we are basically iterating over all attributes of all elements and checking for a pattern match. Then, once we get all the matching tags we are iterating over the results and get a value of a matching attribute. I really don't like the fact that we apply the regex check twice - when looking for tags and when checking for a desired attribute of a matched tag.


    lxml.html and it's XPath powers allow working with attributes directly, but lxml supports XPath 1.0 which does not have regular expression support. You can do smth like:

    In [10]: from lxml.html import fromstring
    
    In [11]: root = fromstring(data)
    
    In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]') 
    Out[12]: 
    ['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518'] 
    

    which is not 100% what you did and would probably generate false positives, but you can take it further and add more "substring in a string" checks if needed.

    Or, you can grab all the attributes of all elements and filter using the regex you already have:

    In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)]
    Out[14]: 
    ['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
     'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']