Search code examples
pythonregexlisttext-parsing

How to iterate through a list and extract text between quotation marks using Python 2.7.10


I'm trying to iterate through a long list (let's call it url_list) where each item looks like:

<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>, <a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>, <a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>, <a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>,

and so on. I'd like to iterate through the list and keep only the text between the first two quotation marks, and throw away the rest - i.e:

https://www.example.com/5th-february-2018/, https://www.example.com/4th-february-2018/, https://www.example.com/3rd-february-2018/, https://www.example.com/2nd-february-2018/,

So essentially I am trying to return a nice clean list of urls. I'm not having much luck iterating through the list and splitting on the quotation marks - is there a better way to do this? Is there a way to throw away everything after the itemprop= string?


Solution

  • Using Regex:

    import re
    
    url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
    for i in url_list:
        print re.search("(?P<url>https?://[^\s]+)/", i).group("url")
    

    Output:

    https://www.example.com/5th-february-2018
    https://www.example.com/4th-february-2018