Search code examples
xmlxslttransformation

Transform text file without delimiter with xslt to xml


I'm searching for the right tool to transform text files into xml.

The text file looks like this:

2017-01-03-10.11.1201000B  H4_01DE33411121...
2017-01-01-09.12.1301000BHAX4_01DE34256137...
2017-01-01-10.12.1301000BMLH4_01DE63789221...

Each line is the content of an entity and I have following information:

Letter 0-18: Attribute1
Letter 19-21: Attribute2
Letter 22-23: Attribute3
Letter 24: Attribute4
Letter 25-31: Attribute5
and so on....

and so on...

Now I'm searching for a tool which transforms this text file along this rules to following xml

<entities>
    <entity>
        <attribute1>2017-01-03-10.11.12</attribute1>
        <attribute2>010</attribute2>
        <attribute3>00</attribute3>
        <attribute4>B</attribute4>
        <attribute5>H4_01</attribute5>
        ... and so on
    </entity>
    <entity>
        <attribute1>2017-01-01-09.12.13</attribute1>
        <attribute2>010</attribute2>
        <attribute3>00</attribute3>
        <attribute4>B</attribute4>
        <attribute5>HAX4_01</attribute5>
        ... and so on
    </entity>
   <entity>
        <attribute1>2017-01-01-10.12.13</attribute1>
        <attribute2>010</attribute2>
        <attribute3>00</attribute3>
        <attribute4>B</attribute4>
        <attribute5>MLH4_01</attribute5>
        ... and so on
    </entity>
</entities>

The tool needs also to implement some simple logic, for example trimming Strings, if/else, Date format conversion.

First, I thought on using xslt - so the owner of this weird text file could produce the corresponding configuration file even on his own (that would be best!). But I often read that xslt is only for converting xml to other formats, not to convert plain text files to xml.

It should also be maintainable so a shell script using awk and sed would be very confusing.

Do yo know a tool which is more suitable than xslt?


Solution

  • A smart way to do this is to generate an XSLT stylesheet from a data description file that describes the input.

    If the data description file has

    <fields>
      <field name="attribute1" start="1" length="18"/>
      <field name="attribute2" start="19" length="2"/>
    </fields>
    

    then it's pretty easy to generate an XSLT 3.0 transformation which does

    <xsl:template name="main">
      <entities>
        <xsl:for-each select="unparsed-text-lines('input.xml')">
          <entity>
            <attribute1>{substring(., 1, 18)}</attribute1>
            <attribute2>{substring(., 1, 18)}</attribute2>
          </entity>
        </xsl:for-each>
      </entities>
    </xsl:template>
    

    (and generating XSLT 2.0 is only very slightly more complex, but doing XSLT 1.0 is harder because you can't read a plain text file directly).

    Implementing your "simple logic" is a bit trickier, but it wouldn't be hard to add an extra field to the data description:

    <field name="attribute1" start="1" length="18" action="checkDate"/>
    

    which causes the generated XSLT to be

    <attribute1>{f:checkDate(substring(., 1, 18))}</attribute1>
    

    invoking a function in the stylesheet such as

    <xsl:function name="f:checkDate" as="xs:string">
      <xsl:param name="in" as="xs:string"/>
      <xsl:sequence select="if ($in castable as xs:date) then $in else error(...)"/>
    </xsl:function>