Search code examples
pythonpython-pattern

Pattern web unable to locate elements by class names


I'm trying to identify DOM elements by class name, but I'm not able to use the pattern.web as described in the docs (I'm also running code that I've used before, so it did work at some point).

from pattern.web import DOM

html = """<html><head><title>pattern.web | CLiPS</title></head>
<body>
  <div class="class1 class2 class3">
    <form action="/pages/pattern-web"  accept-charset="UTF-8" method="post" id="search-block-form">
      <div>
        <label for="edit-search-block-form-1">Search this site: </label>
      </div>
    </form>
  </div>
</body></html>"""

dom = DOM(html)
print "Search Results by Method:"
print 'tag[attr="value"] Notation Results:'
print dom('div[class="class1 class2 class3"]')
print 
print 'tag.class Notation Results:'
print dom('div.class1')
print
print 'By class, no tag results:'
print dom.by_class('class1')
print 
print 'Looping through all divs and printing matching results:'
for i in dom('div'):
    if 'class' in i.attrs and i.attrs['class'] == 'class1 class2 class3':
        print i.attrs

Note that (Element and DOM functions are interchangeable and give the same results). The result is this:

Search Results by Method:
tag[attr="value"] Notation Results:
[]

tag.class Notation Results:
[]

By class, no tag results:
[Element(tag='div')]

Looping through all divs and printing matching results:
{u'class': u'class1 class2 class3'}

As you can see, looking it up using the tag.class notation and the tag[attr="value"] notation both give empty results, but by_class returns one result. Clearly elements with those attributes exist. How do I search for all the divs that have all 3 classes?

In the past, I've been able to search using dom('div.class1.class2.class3') to identify a div with all 3 classes. Not only does this not work, it's also giving me unicode errors (it appears that the second period causes a unicode error) : TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'


Solution

  • Question: In the past, I've been able to search using dom('div.class1.class2.class3') to identify a div with all 3 classes.


    Reading the Source github.com/clips/pattern/blob/master/pattern/web,
    found, it's only a wrapper using Beautiful Soup.

    # Beautiful Soup is wrapped in DOM, Element and Text classes, resembling the Javascript DOM.
    # Beautiful Soup can also be used directly


    It's a known Issue, see SO: Beautiful soup find_all doesn't find CSS selector with multiple classes

    The workaround ist to use .select(...) instead of .find_all(...),
    didn't find .select(...) in pattern.web

    For example:

    from bs4 import BeautifulSoup
    
    html = """<html><head><title>pattern.web | CLiPS</title></head>
      <body>
        <div class="class1 class4">
          <form action="/pages/pattern-web"  accept-charset="UTF-8" method="post" id="search-block-form">
            <div class="class1 class2 class3">
              <label for="edit-search-block-form-1">Search this site: </label>
            </div>
          </form>
        </div>
    </body></html>
    """
    soup = BeautifulSoup(html, 'html.parser')
    div = soup.select('div.class1.class2')
    print("{}".format(div))
    

    Output:

    [<div class="class1 class2 class3">
    <label for="edit-search-block-form-1">Search this site: </label>
    </div>]
    

    Question: it's also giving me unicode errors (it appears that the second period causes a unicode error) :

    TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'
    

    It's unknown, if this TypeError is from pattern.web or Beautiful Soup.
    According to this SO:descriptor-join-requires-a-unicode-object-but-received-a-str it's a standard Python message.


    Using pattern.web from GitHub, the results are as expected:

    from pattern.web import Element
    
    elements = Element(html)
    print("Search Results by Method:")
    print('tag[attr="value"] Notation\tResults:{}'
        .format(elements('div[class="class1 class2 class3"]')))
    
    print('tag.class Notation \t\t\tResults:{}'
        .format(elements('div.class1.class2.class3')))
    
    print('By class, no tag \t\t\tResults:{}'
        .format(elements.by_class('class1 class2 class3')))
    
    print('Looping through all divs and printing matching results:')
    for i in elements('div'):
        if 'class' in i.attrs:
            if " ".join(i.attrs['class']) == 'class1 class2 class3':
                print("\tmatch:{}".format(i.attrs))
    

    Output:

    Search Results by Method:
    tag[attr="value"] Notation  Results:{'class': ['class1', 'class2', 'class3']}
    tag.class Notation          Results:{'class': ['class1', 'class2', 'class3']}
    By class, no tag            Results:{'class': ['class1', 'class2', 'class3']}
    Looping through all divs and printing matching results:
        match:{'class': ['class1', 'class2', 'class3']}
    

    Tested with Python:3.5.3 - pattern.web:3.6 - bs4:4.5.3