Search code examples
mysqlxmldata-dump

To read SO's data dump effectively


I use currently Vim to read SO's data dump. However, my Macbook slows down when I roll down just a few rows. This suggests me that there must be more efficient ways to read the data.

I know little MySQL. The files are in .xml -format. It is rather hard to read the data at the moment in .xml. It may be more efficient to convert the xml -files to MySQL and then read the files. I know only MS db -tool for such actions. However, I would like to know another tool too.

Problems

  1. to parse .xml to SQL -queries such that MySQL understand it. We need to know data structures of the data.
  2. to run the data in MySQL
  3. to find some tool similar to MS db -tool by which we can read the data effectively

How do you read SO's data dump effectively?

--

[edit]

  1. How can you run the 523 SQL queries to create the database in your terminal? I have the commands at the moment in a text -file.
  2. How can you "switch to [the recovery mode] to a simple recovery mode in the database?

Solution

  • I made my first ever python program to read them and output SQL insert statements for use with mysql (It's ugly but worked). You'll need to create the tables first though by hand.

    import xml.sax.handler
    import xml.sax
    import sys
    class SOHandler(xml.sax.handler.ContentHandler):
            def __init__(self):
                    self.errParse = 0
    
            def startElement(self, name, attributes):
                    if name != "row":
                            self.table = name;
                            self.outFile = open(name+".sql","w")
                            self.errfile = open(name+".err","w")
                    else:
                            skip = 0
                            currentRow = u"insert into "+self.table+"("
                            for attr in attributes.keys():
                                    currentRow += str(attr) + ","
                            currentRow = currentRow[:-1]
                            currentRow += u") values ("
                            for attr in attributes.keys():
                                    try:
                                            currentRow += u'"{0}",'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
                                    except UnicodeEncodeError:
                                            self.errParse += 1;
                                            skip = 1;
                                            self.errfile.write(currentRow)
                            if skip != 1:
                                    currentRow = currentRow[:-1]
                                    currentRow += u");"
                                    #print len(attributes.keys())
                                    self.outFile.write(currentRow.encode("utf-8"))
                                    self.outFile.write("\n")
                                    self.outFile.flush()
                                    print currentRow.encode("utf-8");
    
            def characters(self, data):
                    pass
    
            def endElement(self, name):
                    pass
    
    if len(sys.argv) < 2:
            print "Give me an xml file argument!"
            sys.exit(1)
    
    parser = xml.sax.make_parser()
    handler = SOHandler()
    parser.setContentHandler(handler)
    parser.parse(sys.argv[1])
    print handler.errParse