Search code examples
performancebeautifulsoupfindall

BS4: How to reduce find_all to a minimum (ignoring instead of extracting)


I need to ignore Comments and Doctype for later operations (because I will replace some characters which will no longer allow me to distinguish comments and doctype later on).

Minimal Example

#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup, Comment, Doctype


def is_toremove(element):
    return isinstance(element, Comment) or isinstance(element, Doctype)


def test1():
    html = \
    '''
    <!DOCTYPE html>
    word1 word2 word3 word4
    <!-- A comment -->
    '''
    soup = BeautifulSoup(html, features="html.parser")
    to_remove = soup.find_all(text=is_toremove)
    for element in to_remove:
        element.extract()

    # some operations needing soup.findAll
    for txt in soup.findAll(text=True):
        # some replace computations
        pass
    return soup
print(test1())

The intended result would be "word1 word2 word3 word4" replaced by the replace computations. It works, but I don't think it is very efficient. I thought about doing something like

for txt in soup.findAll(text=not is_toremove()):

to only work with the non removed parts.

So my questions are:

  1. Is there some inside magic going on that allows you to call findAll twice without being inefficient or
  2. How do I get them both into one find_all

I also tried to go for the parent tag:

if(not isinstance(txt, Doctype)

or

if(txt.parent.name != "[document]")

for example. This didn't changed a thing in my main program.


Solution

  • As stated in the comments, if you want to get only plain NavigableString, you can do this:

    from bs4 import BeautifulSoup, NavigableString
    
    
    html = '''
    <!DOCTYPE html>
    word1 word2 word3 word4
    <!-- A comment -->
    '''
    
    def is_string_only(t):
        return type(t) is NavigableString
    
    soup = BeautifulSoup(html, 'lxml')
    
    for visible_string in soup.find_all(text=is_string_only):
        print(visible_string)
    

    Prints:

    word1 word2 word3 word4