Search code examples
c#xmltranslationxliff

How do you read, process and write content of a non-standard formatted xml


I'm trying to process the content of a Language-XML-File in C# for machine translations.

The content of <seg-source> Segments should be translated and written back to the <target> segments. The formatting of tags inside the source or target segments should stay the same.

My first problem is, that the xml file is not correctly read because of the start and end tags not being <xml> and </xml>. Replacing the first two lines of text with the <xml>-tag does not work because the original XML-File is all written in one line (The following example is formatted for better reading).

Is there an easy way to copy all source information that should be translated to an array and write it back after I've processed it?

This is what the XML-Files (.sdlxliff) look like:

<?xml version="1.0" encoding="utf-8"?>
<xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0">
    <file original="" datatype="x-sdlfilterframework2" source-language="de-DE" target-language="en-US">
        <header>
            <file-info xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
                <value key="SDL:FileId">77260240-fccf-4e75-81e3-7a1ab00fe948</value>
                <value key="SDL:CreationDate">03/18/2022 16:00:07</value>
                <value key="SDL:OriginalFilePath"></value>
                <value key="SDL:FileTypeDllVersion">1.8.2.0</value>
                <value key="SDL:OriginalEncoding">utf-8</value>
                <value key="SDL:AutoClonedFlagSupported">True</value>
                <value key="HasUtf8Bom">False</value>
                <value key="LineBreakType">
</value>
                <value key="ParagraphTextDirections"/>
                <sniff-info>
                    <detected-encoding detection-level="Likely" encoding="utf-8"/>
                    <detected-source-lang detection-level="Guess" lang="de-DE"/>
                    <props>
                        <value key="HasUtf8Bom">False</value>
                        <value key="LineBreakType">
</value>
                    </props>
                </sniff-info>
            </file-info>
            <sdl:filetype-info>
                <sdl:filetype-id>Plain Text v 1.0.0.0</sdl:filetype-id>
            </sdl:filetype-info>
            <tag-defs xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
                <tag id="0">
                    <st name="^">^</st>
                </tag>
                <tag id="1">
                    <st name="$">$</st>
                </tag>
                <tag id="2">
                    <st name="^">^</st>
                </tag>
                <tag id="3">
                    <st name="$">$</st>
                </tag>
                <tag id="4">
                    <st name="^">^</st>
                </tag>
                <tag id="5">
                    <st name="$">$</st>
                </tag>
            </tag-defs>
        </header>
        <body>
            <trans-unit translate="no" id="08c58142-03fe-4aad-8bc6-64e45600e91c">
                <source>
                    <x id="0"/>
                </source>
            </trans-unit>
            <trans-unit id="038509df-7f97-4faa-867f-ec00a1290f62">
                <source>Ein Satz zu übersetzen</source>
                <seg-source>
                    <mrk mtype="seg" mid="1">Ein Satz zu übersetzen</mrk>
                </seg-source>
                <target>
                    <mrk mtype="seg" mid="1"/>
                </target>
                <sdl:seg-defs>
                    <sdl:seg id="1"/>
                </sdl:seg-defs>
            </trans-unit>
            <trans-unit translate="no" id="b3f5e43b-6bba-41e4-a9fd-b7e4077694cc">
                <source>
                    <x id="1"/>
                    <x id="2"/>
                </source>
            </trans-unit>
            <trans-unit id="4c7dcbe2-1ebe-4e56-bb9a-2fe647b12f1f">
                <source>Ein zweiter Satz zu übersetzen</source>
                <seg-source>
                    <mrk mtype="seg" mid="2">Ein zweiter Satz zu übersetzen</mrk>
                </seg-source>
                <target>
                    <mrk mtype="seg" mid="2"/>
                </target>
                <sdl:seg-defs>
                    <sdl:seg id="2"/>
                </sdl:seg-defs>
            </trans-unit>
            <trans-unit translate="no" id="0ca0c301-f5a2-44e8-8754-7618c98e14c6">
                <source>
                    <x id="3"/>
                    <x id="4"/>
                </source>
            </trans-unit>
            <trans-unit id="5b3973af-b0cf-4dcf-b66c-aea309389c2d">
                <source>Ein letzter weiterer Satz zu übersetzen</source>
                <seg-source>
                    <mrk mtype="seg" mid="3">Ein letzter weiterer Satz zu übersetzen</mrk>
                </seg-source>
                <target>
                    <mrk mtype="seg" mid="3"/>
                </target>
                <sdl:seg-defs>
                    <sdl:seg id="3"/>
                </sdl:seg-defs>
            </trans-unit>
            <trans-unit translate="no" id="1cced868-b401-45c5-be2b-ea1fede236c0">
                <source>
                    <x id="5"/>
                </source>
            </trans-unit>
        </body>
    </file>
</xliff>

This is my code for reading the file, but I have no clue how to deal with tags in side the source segments and I guess there must be a better way to replace the start tag:

    string fileContents = File.ReadAllText(ofd_ToTranslate.FileName);

    fileContents = fileContents.Replace("<?xml version=\"1.0\" encoding=\"utf - 8\"?><xliff xmlns:sdl=\"http://sdl.com/FileTypes/SdlXliff/1.0\" xmlns=\"urn:oasis:names:tc:xliff:document:1.2\" version=\"1.2\" sdl:version=\"1.0\">", "<xml>");
    fileContents = fileContents.Replace("</xliff>", "</xml>");

    XmlReaderSettings settings = new XmlReaderSettings { NameTable = new NameTable() };
    XmlNamespaceManager xmlns = new XmlNamespaceManager(settings.NameTable);
    xmlns.AddNamespace("sdl", "");
    XmlParserContext context = new XmlParserContext(null, xmlns, "", XmlSpace.Default);
    XmlReader reader = XmlReader.Create(new StringReader(fileContents), settings, context);
    XmlDocument xmlDoc = new XmlDocument();

    xmlDoc.Load(reader);

    XmlNodeList sourceElements = xmlDoc.GetElementsByTagName("source");
    XmlNodeList targetElements = xmlDoc.GetElementsByTagName("target");

Solution

  • Your XML is perfectly fine, but it has a default namespace:

    xmlns="urn:oasis:names:tc:xliff:document:1.2"
    

    To access the nodes you need to use the namespace.

    Here's an example:

    var xd = XDocument.Load(@"file.xml");
    var xn = XNamespace.Get("urn:oasis:names:tc:xliff:document:1.2");
    var tus = xd.Root?.Descendants(xn + "trans-unit");
    Console.WriteLine(tus.Count());
    

    That outputs 7 for me.