Search code examples
pythonhtml-content-extraction

Using Beautiful Soup Python module to replace tags with plain text


I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.

I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.

So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?

Thanks for your suggestions!

Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

When I run the above code, I get the following error:

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.

Could you please let me know what I am doing wrong?

-ecognium


Solution

  • An approach that works for your specific example is:

    from BeautifulSoup import BeautifulSoup
    
    ht = '''
    <div id="abc">
        some long text goes <a href="/"> here </a> and hopefully it 
        will get picked up by the parser as content
    </div>
    '''
    soup = BeautifulSoup(ht)
    
    anchors = soup.findAll('a')
    for a in anchors:
      a.previousSibling.replaceWith(a.previousSibling + a.string)
    
    results = soup.findAll(text=lambda(x): len(x) > 20)
    
    print results
    

    which emits

    $ python bs.py
    [u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']
    

    Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

    I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!