Search code examples
pythonpandastext-extraction

extract strings from HTML tag pandas


How do I extract the following strings using str.extract or regex or any efficient way using python pandas in this tags below

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>

am using:
.str.extract('(>[A-Za-z])<')

I want this output:
Twitter for iPhone
Twitter Web Client
Vine - Make a Scene
TweetDeck


Solution

  • Thie might help:

    import pandas as pd
    lst = [
        ['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'],
        ['<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'],
        ['<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'],
        ['<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']
    ]
    
    df = pd.DataFrame(lst, columns=['url'])
    df['text'] = df['url'].str.extract(r'>(.*?)<')
    print(df)
    

    Output

                                                     url                 text
    0  <a href="http://twitter.com/download/iphone" r...   Twitter for iPhone
    1  <a href="http://twitter.com" rel="nofollow">Tw...   Twitter Web Client
    2  <a href="http://vine.co" rel="nofollow">Vine -...  Vine - Make a Scene
    3  <a href="https://about.twitter.com/products/tw...            TweetDeck