Search code examples
pythonstringfileslicefile-pointer

Python f.seek and string slice yields different results


I'm working on a small personal project of mine in Python which interprets an XML file as a script for a text-based console game. All the separate source files are merged into this one large XML file. In order not to use up too much memory by loading the contents of the entire file, I decided to use a separate JSON file as some sort of table of contents pointing to the various ? (including the tags themselves) tags.

This is an example of said table of contents: {"loremipsum1": [95, 366], "loremipsum3": [462, 283], "loremipsum_insamefile": [746, 62], "loremipsum2": [809, 603]}. The first value in [,] contains the starting character (well, supposedly) "<" and the second value contains the length of the scene itself. The XML itself does not matter, what matters is how to extract text blocks according to these two parameters.

95 is the length of the header; which is the length of *.


The method I was using involves something along the lines of:

# fcoords contain the [?,?] value in the table of contents
def parse_scene(readfile,fcoords):
    readfile.seek(fcoords[0],0)
    scene = readfile.read(fcoords[1])
# interpreter implementation would go here, but here's a print statement for now since it keeps on throwing errors
    print(scene)

Unfortunately it didn't work as planned, instead returning something like:

========FSEEK========
nes>
<scene name="loremipsum1">
<text>Lorem ipsum.</text>
<continue></continue>
<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam cursus tempus ipsum vitae euismod. Suspendisse sit amet nulla in sem sagittis cursus.
</text>
<options>
<link ref="loremipsum2">Nullam sollicitudin</link>
<link ref="loremipsum1">Ut tortor felis</link>
</options>
</s

After that I experimented a bit with the file, and used the following script to compare seek-read and read-slice; the output is different for the both of them, and not just a small difference of one or two characters at the front or back. It would drag out the post longer if I paste the differences so here is the script if you want to test it out.

Test Script

import json
# biblio.json contains the table of contents; feel free to replace this with just biblio = {}
with open("biblio.json",'r',encoding='utf-8') as f2:
    biblio = json.load(f2)
# compiled.xml contains the .xml file attached at the bottom of the post
with open("compiled.xml",'r',encoding='utf-8') as f:
    compiled_str = f.read()
    for pair in biblio.items():
        print("\033[92m========FSEEK========\033[0m")
        f.seek(pair[1][0])
        print(f.read(pair[1][1]))
        print("\033[92m=======SLICE=========\033[0m")
        print(compiled_str[pair[1][0]:pair[1][0]+pair[1][1]])
        print("==========================================")

Difference with Slice

=======SLICE=========
<scene name="loremipsum1">
<text>Lorem ipsum.</text>
<continue></continue>
<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam cursus tempus ipsum vitae euismod. Suspendisse sit amet nulla in sem sagittis cursus.
</text>
<options>
<link ref="loremipsum2">Nullam sollicitudin</link>
<link ref="loremipsum1">Ut tortor felis</link>
</options>
</scene>

Things I've tried:

  1. Removed tabulations from the source during the merging process
  2. Using Microsoft Word to check the characters with spaces and adding the number of lines to it since it doesn't count '\n'. Results vary, some of it matches perfectly with the second value of [,], some needs an increment or decrement.
  3. Played around with how the merger counts the characters, doesn't seem to be wrong (I can provide the merger script if necessary)
  4. Removed encoding="utf-8" from open() arguments, doesn't change anything.
  5. Excluded the newline after the closing tag of each scene; and made sure that the scenes are separated by one (no overlap). seek-read still doesn't work.

Attached here is the XML file that was used.

<game>
<metadata>
<author>['Anonymous']</author>
<version>0.1.0</version>
</metadata>
<scenes>
<scene name="loremipsum1">
<text>Lorem ipsum.</text>
<continue></continue>
<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam cursus tempus ipsum vitae euismod. Suspendisse sit amet nulla in sem sagittis cursus.
</text>
<options>
<link ref="loremipsum2">Nullam sollicitudin</link>
<link ref="loremipsum1">Ut tortor felis</link>
</options>
</scene>
<scene name="loremipsum3">
<text>
Fusce a rutrum ligula, vel fringilla ex.
Sed lobortis eu mauris non dictum. Fusce nec diam nec metus gravida consectetur vitae et nunc.
Aenean sed ullamcorper ipsum. Vivamus pharetra eros a erat cursus, eget euismod sapien lobortis.
</text>
</scene>
<scene name="loremipsum_insamefile">
<text>Hi.</text>
</scene>
<scene name="loremipsum2">
<text>Ut tortor felis, sodales a ipsum ac, semper molestie lacus.
Nunc faucibus ultrices nibh id porttitor. Phasellus sed tempus neque.</text>
<continue>Vestibulum pulvinar</continue>
<text>
Vestibulum pulvinar, odio egestas ullamcorper porta, massa tellus sodales ipsum, a porttitor elit lectus pharetra risus.
Quisque et congue justo. Integer in quam diam. Nunc id orci justo. Phasellus sed hendrerit dolor.
</text>
<import ref="loremipsum3"></import>

<options>
<link ref="loremipsum1">Lorem ipsum.</link>
<link ref="loremipsum2">Ut tortor felis.</link>
</options>
</scene>
</scenes>
</game>

Notes:

  1. When the file is assembled, I use \n endings
  2. I'm using Windows
  3. All files are read with the UTF-8 encoding
  4. I'm using Python 3.9.7

Question Why is the output different for both methods (seek-read and read-slice)?, and How do I make it so that I can find the scene properly without having to load up the entire file (it's small for now) into memory?

(also is it possible to have those shrinkable spoilers rather than the ones that just hides text so I can format this better, since the examples are taking up way too much space)


Solution

  • It turns out that the line endings are indeed the problem in my case.

    open(filename,mode,newline="") fixed it.

    Here is a quote from the documentation for future reference.

    newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

    When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

    When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

    The table of contents was made using len() on the stringified version of the source files, which uses \n as its line ending, but it is later written to the compiled XML file, converting to \r\n. The extra \r's not present when the table of contents is formed seems to cause the offset.

    (now I have to wait two days to accept my own answer huh)