I wanna Iterate through the Users Stackoverflow dump File. The problem is it is very Huge and it is XML. For me xml is a new Topic. I read several Documentation and Stackoverflow Post but for some reason it doesn't work.
XML Format:
<users>
<row Id="-1" Reputation="1"
CreationDate="2008-07-31T00:00:00.000"
DisplayName="Community"
LastAccessDate="2008-08-26T00:16:53.810"
WebsiteUrl="http://meta.stackexchange.com/"
Location="on the server farm" AboutMe="<p>Hi, I'm not really a person.&" Views="649" UpVotes="245983" DownVotes="924377" AccountId="-1"
/>
</users>
The Code:
from xml.etree.ElementTree import iterparse
for evt, elem in iterparse('data/Users.xml', events=('start','end')):
print(evt, elem)
What I get:
The For Loop outprint me a bunch of hexacode. And in the End I get an Memory Exception. Maybe its normal because I try it a second time and it iterate the xml very fast 0.13 seconds
start <Element 'row' at 0x04CC16F0>
end <Element 'row' at 0x04CC16F0>
start <Element 'row' at 0x04CC1810>
I hope you guys can help by the Question. How I get the Value of this Output? I wanna save it into SQL.
All of the File is 199 GB (Badge,Comment,PostLinks,PostHistory,Users,Posts,Tags and Votes). The Users.xml specific for this Question is 2,49 GB. But I wanna put all of the Data From SO into the database.
Yours faithfully
HanahDevelope
It looks like you just need to loop through the end
event for all row
elements and do something with the attributes:
from xml.etree.ElementTree import iterparse
for evt, elem in iterparse('data/Users.xml', events=('end',)):
if elem.tag == 'row':
user_fields = elem.attrib
print(user_fields)
This will output:
{'DisplayName': 'Community', 'Views': '649', 'DownVotes': '924377', 'LastAccessDate': '2008-08-26T00:16:53.810', 'Id': '-1', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Reputation': '1', 'Location': 'on the server farm', 'UpVotes': '245983', 'CreationDate': '2008-07-31T00:00:00.000', 'AboutMe': "<p>Hi, I'm not really a person.", 'AccountId': '-1'}