Search code examples
pythonbeautifulsouptypeerrorpytumblr

"TypeError: unhashable type " while trying to retrieve information using BeautifulSoup, Python


I was trying to use the TumblrAPI, PyTumblr to be specific, to crawl some images in posts with certain tags,

that's the code I use, quite simple:

import pytumblr
from bs4 import BeautifulSoup

# Authenticate via API Key
client = pytumblr.TumblrRestClient('#Here is my API Key#')
print client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)

so the result is something like this:

{
  "meta": {
    "status": 200,
    "msg": "OK"
  },
  "response": {
    "blog": {
      "title": "W é r G i d A",
      "name": "wergida",
      "total_posts": 1181,
      "posts": 1181,
      "url": "http://wergida.tumblr.com/",
      "updated": 1466319493,
      "description": "Ha bárkit érdekelne",
      "is_nsfw": false,
      "ask": false,
      "ask_page_title": "Ask me anything",
      "ask_anon": false,
      "share_likes": true,
      "likes": 1131
    },
    "posts": [
      {
        "blog_name": "wergida",
        "id": 136740690571,
        "post_url": "http://wergida.tumblr.com/post/136740690571/bernhard-bernd-becher-1931-2007-and-hilla",
        "slug": "bernhard-bernd-becher-1931-2007-and-hilla",
        "type": "photo",
        "date": "2016-01-06 11:30:23 GMT",
        "timestamp": 1452079823,
        "state": "published",
        "format": "html",
        "reblog_key": "TiOl8nWT",
        "tags": [
          "industrial facades",
          "bernd and hilla becher",
          "photography",
          "eisenhüttenstadt",
          "brandenburg"
        ],
        "short_url": "https://tmblr.co/ZaE70t1-MOLgB",
        "summary": "Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT...",
        "recommended_source": null,
        "recommended_color": null,
        "highlighted": [],
        "note_count": 2,
        "caption": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br/></p>",
        "reblog": {
          "tree_html": "",
          "comment": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>"
        },
        "trail": [
          {
            "blog": {
              "name": "wergida",
              "active": true,
              "theme": {
                "avatar_shape": "square",
                "background_color": "#FAFAFA",
                "body_font": "Helvetica Neue",
                "header_bounds": "",
                "header_image": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_image_focused": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_image_scaled": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
                "header_stretch": true,
                "link_color": "#529ECC",
                "show_avatar": true,
                "show_description": true,
                "show_header_image": true,
                "show_title": true,
                "title_color": "#444444",
                "title_font": "Gibson",
                "title_font_weight": "bold"
              },
              "share_likes": true,
              "share_following": false
            },
            "post": {
              "id": "136740690571"
            },
            "content_raw": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>",
            "content": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br /></p>",
            "is_current_item": true,
            "is_root_item": true
          }
        ],
        "image_permalink": "http://wergida.tumblr.com/image/136740690571",
        "photos": [
          {
            "caption": "",
            "alt_sizes": [
              {
                "url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
                "width": 1280,
                "height": 973
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_500.jpg",
                "width": 500,
                "height": 380
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_400.jpg",
                "width": 400,
                "height": 304
              },
              {
                "url": "https://65.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_250.jpg",
                "width": 250,
                "height": 190
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_100.jpg",
                "width": 100,
                "height": 76
              },
              {
                "url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_75sq.jpg",
                "width": 75,
                "height": 75
              }
            ],
            "original_size": {
              "url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
              "width": 1280,
              "height": 973
            }
          }
        ]
      }
    ],
    "total_posts": 223
  }
}

But then when I use BeautifulSoup to parse the information I get:

soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")

I got this:

Traceback (most recent call last):
File "tumblr_test.py", line 29, in <module>
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
File "/Users/CB/Public/scrapy/env/lib/python2.7/site-packages/bs4/__init__.py", line 199, in __init__
if markup[:5] == "http:" or markup[:6] == "https:":
TypeError: unhashable type

And I've tried different Parser like "html.parser" "html5lib", still get same error.

Thanks for any clues!


Solution

  • The client.post() call returns a Python dictionary, not a string containing HTML; it has parsed the JSON response for you already. Because BeautifulSoup is trying to treat it as a string you get your error, as :5 is passed to the dictionary as a slice object, and this is not hashable:

    >>> {}[:5]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unhashable type
    

    A dictionary is not HTML. There is no need to try and parse it with BeautifulSoup. Just access individual data elements in the nested structure instead; if such an element is itself a string and that string contains HTML markup, then it may make sense to parse that specific piece of data:

    response = client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
    post = response['response']['posts'][0]
    soup = BeautifulSoup(post['caption'])