I need to ignore Comments and Doctype for later operations (because I will replace some characters which will no longer allow me to distinguish comments and doctype later on).
Minimal Example
#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup, Comment, Doctype
def is_toremove(element):
return isinstance(element, Comment) or isinstance(element, Doctype)
def test1():
html = \
'''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''
soup = BeautifulSoup(html, features="html.parser")
to_remove = soup.find_all(text=is_toremove)
for element in to_remove:
element.extract()
# some operations needing soup.findAll
for txt in soup.findAll(text=True):
# some replace computations
pass
return soup
print(test1())
The intended result would be "word1 word2 word3 word4" replaced by the replace computations. It works, but I don't think it is very efficient. I thought about doing something like
for txt in soup.findAll(text=not is_toremove()):
to only work with the non removed parts.
So my questions are:
I also tried to go for the parent tag:
if(not isinstance(txt, Doctype)
or
if(txt.parent.name != "[document]")
for example. This didn't changed a thing in my main program.
As stated in the comments, if you want to get only plain NavigableString
, you can do this:
from bs4 import BeautifulSoup, NavigableString
html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html, 'lxml')
for visible_string in soup.find_all(text=is_string_only):
print(visible_string)
Prints:
word1 word2 word3 word4