Search code examples
pythonparsinggenealogygedcom

Is there a GEDCOM parser written in Python?


GEDCOM is a standard for exchanging genealogical data.

I've found parsers written in

but none so far written in Python. The closest I've come is the file libgedcom.py from the GRAMPS project, but that is so full of references to GRAMPS modules as to not be usable for me.

I just want a simple standalone GEDCOM parser library written in Python. Does this exist?


Solution

  • A few years ago I wrote a simplistic GEDCOM to XML translator in Python as part of a larger project. I found that dealing with the GEDCOM data in an XML format was much easier (especially when the next step involved XSLT).

    I don't have the code online at the moment, so I've pasted the module into this message. This works for me; no guarantees. Hope this helps though.

    import codecs, os, re, sys
    from xml.sax.saxutils import escape
    
    fn = sys.argv[1]
    
    ged = codecs.open(fn, encoding="cp437")
    xml = codecs.open(fn+".xml", "w", "utf8")
    xml.write("""<?xml version="1.0"?>\n""")
    xml.write("<gedcom>")
    sub = []
    for s in ged:
        s = s.strip()
        m = re.match(r"(\d+) (@(\w+)@ )?(\w+)( (.*))?", s)
        if m is None:
            print "Error: unmatched line:", s
        level = int(m.group(1))
        id = m.group(3)
        tag = m.group(4)
        data = m.group(6)
        while len(sub) > level:
            xml.write("</%s>\n" % (sub[-1]))
            sub.pop()
        if level != len(sub):
            print "Error: unexpected level:", s
        sub += [tag]
        if id is not None:
            xml.write("<%s id=\"%s\">" % (tag, id))
        else:
            xml.write("<%s>" % (tag))
        if data is not None:
            m = re.match(r"@(\w+)@", data)
            if m:
                xml.write(m.group(1))
            elif tag == "NAME":
                m = re.match(r"(.*?)/(.*?)/$", data)
                if m:
                    xml.write("<forename>%s</forename><surname>%s</surname>" % (escape(m.group(1).strip()), escape(m.group(2))))
                else:
                    xml.write(escape(data))
            elif tag == "DATE":
                m = re.match(r"(((\d+)?\s+)?(\w+)?\s+)?(\d{3,})", data)
                if m:
                    if m.group(3) is not None:
                        xml.write("<day>%s</day><month>%s</month><year>%s</year>" % (m.group(3), m.group(4), m.group(5)))
                    elif m.group(4) is not None:
                        xml.write("<month>%s</month><year>%s</year>" % (m.group(4), m.group(5)))
                    else:
                        xml.write("<year>%s</year>" % m.group(5))
                else:
                    xml.write(escape(data))
            else:
                xml.write(escape(data))
    while len(sub) > 0:
        xml.write("</%s>" % sub[-1])
        sub.pop()
    xml.write("</gedcom>\n")
    ged.close()
    xml.close()