I have a list of revisions from a Wikipedia article that I queried like this:
import urllib
import re
def getRevisions(wikititle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib.request.urlopen(url + next).read() #web request
response = str(response)
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions
Which results in a list with each element being a rev
Tag as a string:
['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]
How can I get generate a DF from this list
An "easy" way without using regex would be splitting the string and then parsing:
for rev_string in revisions:
rev_dict = {}
# Skipping the first and last as it's the tag.
attributes = rev_string.split(' ')[1:-1]
#Split on = and take each value as key and value and convert value to string to get rid of excess ""
for attribute in attributes:
key, value = attribute.split("=")
rev_dict[key] = str(value)
df = pd.DataFrame.from_dict(rev_dict)
This sample would create one dataframe per revision. If you would like to gather multiple reivsions in one dictionary then you handle unique attributes (I don't know if these are changing depending on wiki-document) and then after gathering all attributes in the dictionary you convert to a DataFrame.