I'm trying to reformat a RSS file, to which long complex entries are added to the beginning of frequently. I'm quite a noob, and don't know where to begin, so I was looking for a solution on this site, and haven't found it yet. Some of the commands are quite unfamiliar to me, but I have worked with the file quite a bit and downloaded a feed on a schedule.
I'm trying to find the fourth item in the RSS feed (Denoted by the "/item" tag) however, this is where I've hit a snag where I can't find the answer.
(Python 3)
import time
import sched
import urllib.request
import shutil
scheduler = sched.scheduler(time.time, time.sleep)
def rss():
# Download the file from `url` and save it locally under `file_name`:
with urllib.request.urlopen('http://any.website.here/rss') as response, open('test.xml', 'wb') as out_file:
shutil.copyfileobj(response, out_file)
print('Updating RSS')
def trunc():
a = ()
a = open('test.xml', 'r+', encoding = 'utf-8')
c = (0)
for line in a:
if a.readline() == '</item>':
c = c+1
print(c, 'items found!' at )
if c == 4:
return a.tell()
a.seek(0), print(a.read())
a.close
def scheduler_rss():
scheduler.enter(0, 1, rss, ()) # calls rss
scheduler.run()
trunc()
#time.sleep(43200) #time in seconds, this is 12 hours
time.sleep(30) #Variable for testing
for i in range(100):
scheduler_rss()
This is just the most recent iteration of many failed attempts at finding a solution.
Anyway, this is the RSS I've been wrestling with... http://nightvale.libsyn.com/rss and it does copy the file onto my hard drive as I tell it to, and that file can then be read by an RSS feed reader (In my case a ticker). Basically, I guess I'm asking: how can I find the position in the file to truncate the file from that point on, that point being the fourth time the tag is invoked in the .xml file, keeping in mind the feed will be updated regularly and this tag won't be in the same position each version?
If you're interested in a different approach, here's how you can do this using python's xml.dom
module. You can just as well do this with xml.etree
.
from xml.dom.minidom import parse, parseString
dom = parse('test.xml')
... # download and save your xml
items = dom.getElementsByTagName('item')
for item in items:
for child in item.childNodes[:4]:
if len(child.childNodes) > 0:
print(child.tagName + ':', child.firstChild.nodeValue)
Prints something like this for every <item>
tag till the 4th:
title: 110 - Matryoshka
pubDate: Thu, 15 Jun 2017 04:00:00 +0000
guid: ef49bfbd9603243db217053194cc2dc0
link: http://nightvale.libsyn.com/110-matryoshka
...
And now, to truncate all items beyond the 4th element:
parentNode = items[0].parentNode
for i in range(4, len(items)):
parentNode.removeChild(items[i])
dom.writexml(open('test2.xml', 'w'))