Search code examples
pythonreadability

Usage of python-readability


(https://github.com/buriy/python-readability)

I am struggling using this library and I can't find any documentation for it. (Is there any?)

There are some kind of useable pieces calling help(Document) but there is still something wrong.

My code so far:

from readability.readability import Document
import requests

url = 'http://www.somepage.com'

html = requests.get(url, verify=False).content
readable_article = Document(html,   negative_keywords='test_keyword').summary()

with open('test.html', 'w', encoding='utf-8') as test_file:
    test_file.write(readable_article)

According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.

readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()

Gives me a bunch of errors I don't understand:

Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search' Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search'

Could some one give me please a hint on the error or how to deal with it?


Solution

  • There's an error in the library code. If you look at compile_pattern:

    def compile_pattern(elements):
        if not elements:
            return None
        elif isinstance(elements, (list, tuple)):
            return list(elements)
        elif isinstance(elements, regexp_type):
            return elements
        else:
            # assume string or string like object
            elements = elements.split(',')
            return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
    

    You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.

    Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.