I'm trying to identify DOM elements by class name, but I'm not able to use the pattern.web as described in the docs (I'm also running code that I've used before, so it did work at some point).
from pattern.web import DOM
html = """<html><head><title>pattern.web | CLiPS</title></head>
<body>
<div class="class1 class2 class3">
<form action="/pages/pattern-web" accept-charset="UTF-8" method="post" id="search-block-form">
<div>
<label for="edit-search-block-form-1">Search this site: </label>
</div>
</form>
</div>
</body></html>"""
dom = DOM(html)
print "Search Results by Method:"
print 'tag[attr="value"] Notation Results:'
print dom('div[class="class1 class2 class3"]')
print
print 'tag.class Notation Results:'
print dom('div.class1')
print
print 'By class, no tag results:'
print dom.by_class('class1')
print
print 'Looping through all divs and printing matching results:'
for i in dom('div'):
if 'class' in i.attrs and i.attrs['class'] == 'class1 class2 class3':
print i.attrs
Note that (Element
and DOM
functions are interchangeable and give the same results). The result is this:
Search Results by Method:
tag[attr="value"] Notation Results:
[]
tag.class Notation Results:
[]
By class, no tag results:
[Element(tag='div')]
Looping through all divs and printing matching results:
{u'class': u'class1 class2 class3'}
As you can see, looking it up using the tag.class
notation and the tag[attr="value"]
notation both give empty results, but by_class
returns one result. Clearly elements with those attributes exist. How do I search for all the divs that have all 3 classes?
In the past, I've been able to search using dom('div.class1.class2.class3')
to identify a div with all 3 classes. Not only does this not work, it's also giving me unicode errors (it appears that the second period causes a unicode error) : TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'
Question: In the past, I've been able to search using
dom('div.class1.class2.class3')
to identify a div with all 3 classes.
Reading the Source github.com/clips/pattern/blob/master/pattern/web,
found, it's only a wrapper usingBeautiful Soup
.# Beautiful Soup is wrapped in DOM, Element and Text classes, resembling the Javascript DOM.
# Beautiful Soup can also be used directly
It's a known Issue, see SO: Beautiful soup find_all doesn't find CSS selector with multiple classes
The workaround ist to use .select(...)
instead of .find_all(...)
,
didn't find .select(...)
in pattern.web
For example:
from bs4 import BeautifulSoup
html = """<html><head><title>pattern.web | CLiPS</title></head>
<body>
<div class="class1 class4">
<form action="/pages/pattern-web" accept-charset="UTF-8" method="post" id="search-block-form">
<div class="class1 class2 class3">
<label for="edit-search-block-form-1">Search this site: </label>
</div>
</form>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
div = soup.select('div.class1.class2')
print("{}".format(div))
Output:
[<div class="class1 class2 class3"> <label for="edit-search-block-form-1">Search this site: </label> </div>]
Question: it's also giving me unicode errors (it appears that the second period causes a unicode error) :
TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'
It's unknown, if this TypeError
is from pattern.web
or Beautiful Soup
.
According to this SO:descriptor-join-requires-a-unicode-object-but-received-a-str it's a standard Python message.
Using pattern.web
from GitHub, the results are as expected:
from pattern.web import Element
elements = Element(html)
print("Search Results by Method:")
print('tag[attr="value"] Notation\tResults:{}'
.format(elements('div[class="class1 class2 class3"]')))
print('tag.class Notation \t\t\tResults:{}'
.format(elements('div.class1.class2.class3')))
print('By class, no tag \t\t\tResults:{}'
.format(elements.by_class('class1 class2 class3')))
print('Looping through all divs and printing matching results:')
for i in elements('div'):
if 'class' in i.attrs:
if " ".join(i.attrs['class']) == 'class1 class2 class3':
print("\tmatch:{}".format(i.attrs))
Output:
Search Results by Method: tag[attr="value"] Notation Results:{'class': ['class1', 'class2', 'class3']} tag.class Notation Results:{'class': ['class1', 'class2', 'class3']} By class, no tag Results:{'class': ['class1', 'class2', 'class3']} Looping through all divs and printing matching results: match:{'class': ['class1', 'class2', 'class3']}
Tested with Python:3.5.3 - pattern.web:3.6 - bs4:4.5.3