Search code examples
xmlxsltwikirdfontology

Populate Ontology by converting huge xml file into rdf


I try to populate an ontology with data extract from marvel database wikia (you can extract an xml that contains all informations of a wiki). My issue is that this xml is too heavy to do anything with it (more than 500Mo). I've try to transform it into much simple rdf file with xslt but because of the xml file size it's quite impossible.

The xml document is made of pages as this one:

<page>
<title>Aeroika (Earth-616)</title>
<ns>0</ns>
<id>1035</id>
  <sha1>11t0be5viqp0vsj8zwglfu3wea8fou4</sha1>
<revision>
  <id>1786343</id>
  <timestamp>2011-10-04T17:49:37Z</timestamp>
  <contributor>
    <username>HamsterMan</username>
    <id>2082346</id>
  </contributor>
  <minor/>
  <text xml:space="preserve" bytes="1652">{{Marvel Database:Character Template
| Image                   = Aeroika (Earth-616).jpg
| RealName                = Aeroika
| CurrentAlias            = Aeroika
| Aliases                 = 
| Identity                = 
| Affiliation             = [[Defenders (Earth-616)|Defenders]]
| Relatives               = 
| Universe                = Earth-616
| BaseOfOperations        = [[Tunnelworld]]

| Gender                  = Male
| Height                  = 
| Weight                  = 
| Eyes                    = 
| Hair                    = Gold
| UnusualSkinColour       = Gold
| UnusualFeatures         = Wings growing out of his head.
}}
[[Category:Flight]]</text>
</revision>
</page>

For exemple in this case I did a xslt that extract important datas in an rdf.

<xsl:template match="/">
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:si="http://www.w3schools.com/rdf/">

<xsl:for-each select="page">
    <xsl:choose>
        <xsl:when test="contains(revision/text, 'Character Template')">
            <rdf:Description rdf:about="{title}">
                <Image><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Image'),'|'),'=')" /></Image>
                <RealName><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'RealName'),'|'),'=')" /></RealName>
                <CurrentAlias><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'CurrentAlias'),'|'),'=')" /></CurrentAlias>
                <Aliases><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Aliases'),'|'),'=')" /></Aliases>
                <Identity><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Identity'),'|'),'=')" /></Identity>
                <Affiliation><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Affiliation'),'|'),'=')" /></Affiliation>
                <Relatives><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Relatives'),'|'),'=')" /></Relatives>
                <Universe><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Universe'),'|'),'=')" /></Universe>
                <BaseOfOperations><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'BaseOfOperations'),'|'),'=')" /></BaseOfOperations>
                <Gender><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Gender'),'|'),'=')" /></Gender>
                <Height><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Height'),'|'),'=')" /></Height>
                <Weight><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Weight'),'|'),'=')" /></Weight>
                <Eyes><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Eyes'),'|'),'=')" /></Eyes>
                <Hair><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Hair'),'|'),'=')" /></Hair>
                <UnusualSkinColour><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'UnusualSkinColour'),'|'),'=')" /></UnusualSkinColour>
                <UnusualFeatures><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'UnusualFeatures'),'|'),'=')" /></UnusualFeatures>
                <Citizenship><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Citizenship'),'|'),'=')" /></Citizenship>
                <MaritalStatus><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'MaritalStatus'),'|'),'=')" /></MaritalStatus>
                <Occupation><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Occupation'),'|'),'=')" /></Occupation>
                <Education><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Education'),'|'),'=')" /></Education>
                <Origin><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Origin'),'|'),'=')" /></Origin>
                <PlaceOfBirth><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'PlaceOfBirth'),'|'),'=')" /></PlaceOfBirth>
                <Creators><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Creators'),'|'),'=')" /></Creators>
                <First><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'First'),'|'),'=')" /></First>
                <HistoryText><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'HistoryText'),'|'),'=')" /></HistoryText>
                <Powers><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Powers'),'|'),'=')" /></Powers>
                <Abilities><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Abilities'),'|'),'=')" /></Abilities>
                <Strength><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Strength'),'|'),'=')" /></Strength>
                <Weaknesses><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Weaknesses'),'|'),'=')" /></Weaknesses>
                <Equipement><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Equipement'),'|'),'=')" /></Equipement>
                <Transportation><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Transportation'),'|'),'=')" /></Transportation>      
                <Weapons><xsl:value-of select="substring-after(substring-before(substring-after(revision/text, 'Weapons'),'|'),'=')" /></Weapons>
            </rdf:Description>
        </xsl:when>
        <xsl:otherwise>
        </xsl:otherwise>
    </xsl:choose>
</xsl:for-each>
</rdf:RDF>
</xsl:template>

</xsl:stylesheet> 

Do you have any idea of how can i do that ? Thanks


Solution

  • Your XSLT stylesheet transforms "normal" XML to RDF/XML syntax - which will be equally large or even larger, and almost as difficult to process. Moreover, RDF/XML is complex to write by hand, and easy to get wrong. Debugging your XSLT is going to be a nightmare.

    If your goal is to make your dataset more compact and easier to process, I suggest that instead, you transform your XML to RDF Turtle or RDF N-Triples syntax. These are extremely simple, compact text-based formats that lend themselves well to streaming processing, and any RDF-enabled software will be able to read and write these formats.

    You can use XSLT, or if that gives you scalability issues use any programming/scripting language that has some basic XML support - get a streaming XML parser and hook in a simple script/program that processes parser output and creates RDF data on the fly. Or, given that your input XML is fairly regularly structured, you could even skip using an XML parser altogether and just hack together a couple of regular expressions to read data - whichever technology you're most comfortable with.

    Of course, you can also try and use some of the end-user tools out there with built-in support for this kind of thing. For example, Topbraid Composer has some fancy features for this kind of conversion out of the box.