I have what I believe to be a fairly simple issue.
I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments
I'm attempting to single out the tex t between the
"< author >HERE< /author >"
tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.
Best regards, - Mitch Powell
You have an atom feed there, so I'd use feedparser
to handle that:
import feedparser
result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
for entry in result.entries:
print entry.author
This prints:
FreebieFM
micromicros
FreebieFM
Sarah Grimstone
FreebieFM
# etc.
Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree
API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree
's strong point):
from urllib2 import urlopen
from xml.etree import ElementTree
response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
tree = ElementTree.parse(response)
nsmap = {'a': 'http://www.w3.org/2005/Atom'}
for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
print author.text
The nsmap
dictionary lets ElementTree
translate the a:
prefix to the correct namespace for those elements.