Search code examples
pythonxmlxpathxliff

Extracting attributes from an XLIFF file using Python


I am using Python to read an XML-based file, specifically the SDLXLIFF variant of an XLIFF file generated by computer-aided translation software. Such files typically contain a copy of the source file, followed by the body, which contains translation units, which usually contain "source" and "target" text. Pairs of source and target text are generally referred to as "segments". (Sample SDLXLIFF document below. This has only 3 segments, but there could be many thousands.)

The expected output is a dict of segments like {1: ["人口は江戸末期まで概ね3000万人台で安定していたが。","At the end of the Edo period the population was stable at roughly 30 million people.","true"]}.

For each member of the dict the key is the segment id attribute from the segs-def part of the file.

The value is a three-element list containing the source text from <seg-source> that has a mid value matching the segment id, and the target text from <target> that has a mid value matching the segment id, and the locked attribute from the segs-def part of the file.

It seems to me that it should be possible to:

  1. Iterate through the segments in segs-def
  2. Get the id attribute and locked attribute
  3. Search for the source in <seg-source> with an mid that matches id and get the source text
  4. Search for the target in <target> with an mid that matches id and get the target text
  5. Store in the dict a list containing source text, target text and locked status as a value, using id as the key

What to extract

My problems are: a) I have not succeeded in iterating through each element in segs-def and extracting the id and locked attributes b) Once I have the id, I do not know how to search/filter the element to find the one with the matching mid (for a segment id of 1, that would be <mrk mtype="seg" mid="1">)

So far all my code does is extract the source and target text as follows:

from lxml import etree
my_file = "example.sdlxliff"
f_xliff  = open(my_file, encoding='utf-8', mode='r')
xliff_input = ''.join(f_xliff.readlines())   
tree = etree.fromstring(xliff_input)

ns_map = dict()
ns_map['x'] = tree.nsmap[None]

for source, target in zip(tree.xpath('//x:seg-source//x:mrk', namespaces=ns_map), tree.xpath('//x:target//x:mrk', namespaces=ns_map)):
    print(source.text + " --- " + target.text + "\n")

The seg id and locked status are stored in a separate part of the file that looks like this:

<sdl:seg-defs>
    <sdl:seg id="1" locked="true" conf="Translated" origin="interactive">

What are effective and preferably pythonic ways of extracting the segment id and locked attributes from this document so that I can build the dict described above, with the id as the key for each segment and locked stored in a list with the corresponding source and target text as the value?

Sample SDLXLIFF file:

<?xml version="1.0" encoding="utf-8"?>
<xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0"
    xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0">
    <file original="C:\Users\abc\Documents\Studio 2019\Projects\DropFiles\japan.txt" datatype="x-sdlfilterframework2" source-language="ja-JP" target-language="en-US">
        <header>
            <file-info xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
                <value key="SDL:FileId">02621408-34d4-4154-9dd7-7b6998ebe368</value>
                <value key="SDL:CreationDate">09/16/2023 20:30:44</value>
                <value key="SDL:OriginalFilePath">C:\Users\abc\Documents\Studio 2019\Projects\DropFiles\japan.txt</value>
                <value key="SDL:OriginalEncoding">utf-8</value>
                <value key="SDL:AutoClonedFlagSupported">True</value>
                <value key="HasUtf8Bom">False</value>
                <value key="LineBreakType">
                </value>
                <value key="ParagraphTextDirections"></value>
                <sniff-info>
                    <detected-encoding detection-level="Likely" encoding="utf-8"/>
                    <detected-source-lang detection-level="Guess" lang="ja-JP"/>
                    <props>
                        <value key="HasUtf8Bom">False</value>
                        <value key="LineBreakType">
                        </value>
                    </props>
                </sniff-info>
            </file-info>
            <sdl:filetype-info>
                <sdl:filetype-id>Plain Text v 1.0.0.0</sdl:filetype-id>
            </sdl:filetype-info>
            <tag-defs xmlns="http://sdl.com/FileTypes/SdlXliff/1.0">
                <tag id="0">
                    <st name="^">^</st>
                </tag>
                <tag id="1">
                    <st name="$">$</st>
                </tag>
            </tag-defs>
        </header>
        <body>
            <trans-unit translate="no" id="a8a4c497-6cd0-4b42-b87d-9f5bc8cd545e">
                <source>
                    <x id="0"/>
                </source>
            </trans-unit>
            <trans-unit id="ab72d223-8a2a-43b0-b503-af65b7d27de2">
                <source>人口は江戸末期まで概ね3000万人台で安定していたが。明治以降は人口急増期に入り、1967年に初めて1億人を突破した。その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。</source>
                <seg-source>
                    <mrk mtype="seg" mid="1">人口は江戸末期まで概ね3000万人台で安定していたが。</mrk>
                    <mrk mtype="seg" mid="2">明治以降は人口急増期に入り、1967年に初めて1億人を突破した。</mrk>
                    <mrk mtype="seg" mid="3">その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。</mrk>
                </seg-source>
                <target>
                    <mrk mtype="seg" mid="1">At the end of the Edo period the population was stable at roughly 30 million people.</mrk>
                    <mrk mtype="seg" mid="2">The population began growing rapidly in the Meiji Era and thereafter, exceeding 100 million people for the first time in 1967.</mrk>
                    <mrk mtype="seg" mid="3">Subsequently the birthrate began to fall, and after peaking in 2008 the population began an era decline.</mrk>
                </target>
                <sdl:seg-defs>
                    <sdl:seg id="1" locked="true" conf="Translated" origin="interactive">
                        <sdl:prev-origin origin="interactive">
                            <sdl:value key="SegmentIdentityHash">zb5f5d0tJBp6ZfAxFmVvh26SM4E=</sdl:value>
                            <sdl:value key="created_by">STONEPC\abc</sdl:value>
                            <sdl:value key="created_on">09/16/2023 19:31:48</sdl:value>
                            <sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
                            <sdl:value key="modified_on">09/16/2023 19:31:48</sdl:value>
                            <sdl:value key="SDL:OriginalTranslationHash">1069896568</sdl:value>
                        </sdl:prev-origin>
                        <sdl:value key="SegmentIdentityHash">zb5f5d0tJBp6ZfAxFmVvh26SM4E=</sdl:value>
                        <sdl:value key="created_by">STONEPC\abc</sdl:value>
                        <sdl:value key="created_on">09/16/2023 19:31:48</sdl:value>
                        <sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
                        <sdl:value key="modified_on">09/16/2023 19:31:48</sdl:value>
                        <sdl:value key="SDL:OriginalTranslationHash">1069896568</sdl:value>
                    </sdl:seg>
                    <sdl:seg id="2" conf="Translated" origin="interactive">
                        <sdl:value key="SegmentIdentityHash">j8MTFYhJndu21g6nUiW8N28QU/k=</sdl:value>
                        <sdl:value key="created_by">STONEPC\abc</sdl:value>
                        <sdl:value key="created_on">09/16/2023 19:31:56</sdl:value>
                        <sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
                        <sdl:value key="modified_on">09/16/2023 19:31:56</sdl:value>
                        <sdl:value key="SDL:OriginalTranslationHash">1432236465</sdl:value>
                    </sdl:seg>
                    <sdl:seg id="3" conf="Draft" origin="interactive">
                        <sdl:value key="SegmentIdentityHash">US1BN1eE/zdK+R9JVk9NSg+LmyU=</sdl:value>
                        <sdl:value key="created_by">STONEPC\abc</sdl:value>
                        <sdl:value key="created_on">09/16/2023 19:32:02</sdl:value>
                        <sdl:value key="last_modified_by">STONEPC\abc</sdl:value>
                        <sdl:value key="modified_on">09/16/2023 19:32:02</sdl:value>
                    </sdl:seg>
                </sdl:seg-defs>
            </trans-unit>
            <trans-unit translate="no" id="acaff8f7-6e91-4012-b909-2dbe76238709">
                <source>
                    <x id="1"/>
                </source>
            </trans-unit>
        </body>
    </file>
</xliff>

Solution

  • According your additional explanation:

    import xml.etree.ElementTree as ET
    from collections import defaultdict
    
    tree = ET.parse("example.sdlxliff")
    root = tree.getroot()
    
    ns = {'n': 'urn:oasis:names:tc:xliff:document:1.2', 'm': 'http://sdl.com/FileTypes/SdlXliff/1.0'}
    
    src = {}
    for mrk in root.findall(".//n:seg-source/n:mrk[@mid]", namespaces=ns):
        src[mrk.get('mid')]=mrk.text
    
    targ = {}
    for mrk in root.findall(".//n:target/n:mrk[@mid]", namespaces=ns):
        targ[mrk.get('mid')]=mrk.text
    
    defs = {}
    for seg in root.findall(".//m:seg-defs/m:seg[@id]", namespaces=ns):
        #print(seg.attrib)
        if seg.get('locked') == None:
            defs[seg.get('id')]='false'
        else:
            defs[seg.get('id')]=seg.get('locked')
    
    dd = defaultdict(list)
    
    for b in (src, targ, defs):
        for key, value in b.items():
            dd[key].append(value)
            
    for k, v in dd.items():
        print(f'{{{k}:{v}}}')
    

    Output:

    {1:['人口は江戸末期まで概ね3000万人台で安定していたが。', 'At the end of the Edo period the population was stable at roughly 30 million people.', 'true']}
    {2:['明治以降は人口急増期に入り、1967年に初めて1億人を突破した。', 'The population began growing rapidly in the Meiji Era and thereafter, exceeding 100 million people for the first time in 1967.', 'false']}
    {3:['その後出生率の低下に伴い2008年にピークを迎え、人口減少期が始まった。', 'Subsequently the birthrate began to fall, and after peaking in 2008 the population began an era decline.', 'false']}