I was trying to use the TumblrAPI, PyTumblr to be specific, to crawl some images in posts with certain tags,
that's the code I use, quite simple:
import pytumblr
from bs4 import BeautifulSoup
# Authenticate via API Key
client = pytumblr.TumblrRestClient('#Here is my API Key#')
print client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
so the result is something like this:
{
"meta": {
"status": 200,
"msg": "OK"
},
"response": {
"blog": {
"title": "W é r G i d A",
"name": "wergida",
"total_posts": 1181,
"posts": 1181,
"url": "http://wergida.tumblr.com/",
"updated": 1466319493,
"description": "Ha bárkit érdekelne",
"is_nsfw": false,
"ask": false,
"ask_page_title": "Ask me anything",
"ask_anon": false,
"share_likes": true,
"likes": 1131
},
"posts": [
{
"blog_name": "wergida",
"id": 136740690571,
"post_url": "http://wergida.tumblr.com/post/136740690571/bernhard-bernd-becher-1931-2007-and-hilla",
"slug": "bernhard-bernd-becher-1931-2007-and-hilla",
"type": "photo",
"date": "2016-01-06 11:30:23 GMT",
"timestamp": 1452079823,
"state": "published",
"format": "html",
"reblog_key": "TiOl8nWT",
"tags": [
"industrial facades",
"bernd and hilla becher",
"photography",
"eisenhüttenstadt",
"brandenburg"
],
"short_url": "https://tmblr.co/ZaE70t1-MOLgB",
"summary": "Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT...",
"recommended_source": null,
"recommended_color": null,
"highlighted": [],
"note_count": 2,
"caption": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br/></p>",
"reblog": {
"tree_html": "",
"comment": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>"
},
"trail": [
{
"blog": {
"name": "wergida",
"active": true,
"theme": {
"avatar_shape": "square",
"background_color": "#FAFAFA",
"body_font": "Helvetica Neue",
"header_bounds": "",
"header_image": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_image_focused": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_image_scaled": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_stretch": true,
"link_color": "#529ECC",
"show_avatar": true,
"show_description": true,
"show_header_image": true,
"show_title": true,
"title_color": "#444444",
"title_font": "Gibson",
"title_font_weight": "bold"
},
"share_likes": true,
"share_following": false
},
"post": {
"id": "136740690571"
},
"content_raw": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>",
"content": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br /></p>",
"is_current_item": true,
"is_root_item": true
}
],
"image_permalink": "http://wergida.tumblr.com/image/136740690571",
"photos": [
{
"caption": "",
"alt_sizes": [
{
"url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
"width": 1280,
"height": 973
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_500.jpg",
"width": 500,
"height": 380
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_400.jpg",
"width": 400,
"height": 304
},
{
"url": "https://65.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_250.jpg",
"width": 250,
"height": 190
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_100.jpg",
"width": 100,
"height": 76
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_75sq.jpg",
"width": 75,
"height": 75
}
],
"original_size": {
"url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
"width": 1280,
"height": 973
}
}
]
}
],
"total_posts": 223
}
}
But then when I use BeautifulSoup to parse the information I get:
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
I got this:
Traceback (most recent call last):
File "tumblr_test.py", line 29, in <module>
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
File "/Users/CB/Public/scrapy/env/lib/python2.7/site-packages/bs4/__init__.py", line 199, in __init__
if markup[:5] == "http:" or markup[:6] == "https:":
TypeError: unhashable type
And I've tried different Parser like "html.parser" "html5lib", still get same error.
Thanks for any clues!
The client.post()
call returns a Python dictionary, not a string containing HTML; it has parsed the JSON response for you already. Because BeautifulSoup is trying to treat it as a string you get your error, as :5
is passed to the dictionary as a slice object, and this is not hashable:
>>> {}[:5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type
A dictionary is not HTML. There is no need to try and parse it with BeautifulSoup. Just access individual data elements in the nested structure instead; if such an element is itself a string and that string contains HTML markup, then it may make sense to parse that specific piece of data:
response = client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
post = response['response']['posts'][0]
soup = BeautifulSoup(post['caption'])