Search code examples
pythonstring-length

Learning Python - len() returns 2n+2


I'm sorry if this is a duplicate post but search seemed to yield no useful results...or maybe I'm such a noob that I'm not understanding what is being said in the answers.

I wrote this small code for practice (following "learning Python the hard way"). I tried to make a shorter version of a code which was already given to me.

from sys import argv

script, from_file, to_file = argv

# here is the part where I tried to simplify the commands and see if I still get the same result,
# Turns out it's the same 2n+2
trial = open(from_file)
trial_data = trial.read()
print(len(trial_data))
trial.close()

# actual code after defining the argumentative variables
in_file = open(from_file).read()

input(f"Transfering {len(in_file)} characters from {from_file} to {to_file}, hit RETURN to continue, CRTL-C to abort.")
#'in_data = in_file.read()

out_file = open(to_file, 'w').write(in_file)

When using len() it always seems to return 2n+2 value instead of n, where n is the actual number of characters in the text file. I also made sure there are no extra lines in the text file.

Can someone kindly explain?

TIA

I was expecting the exact number of characters found in the txt file to be returned. Turns out it's too much to ask.

Edit: since so many are asking for a practical example....here it goes:

The poem 
dedicated to Puxijn
The Chonk one

What i get is

ÿþT h e   p o e m

 d e d i c a t e d   t o   P u x i j n

 T h e   C h o n k   o n e

I think it is an encoding problem. I'm using the latest python if that is of any help.


Solution

  • Based on your updated question, you're definitely reading from UTF-16 encoded text files using the locale default encoding (probably latin-1 or cp1252, both of which would decode the UTF-16 BOM to ÿþ; Windows often uses cp1252 as the default, and latin-1, while largely eclipsed by UTF-8 in the present day, was a popular locale on older UNIX-likes for a long time). Those encodings will read any old bytes without error, even if the encoding is wrong (they map one to one from all 256 bytes to a matching 256 characters), producing gibberish (for bytes outside the ASCII range), and weird gaps (for the null bytes before each ASCII character in UTF-16).

    Change all your open calls to add an extra argument, encoding='utf-16', e.g.:

    trial = open(from_file, encoding='utf-16')
    

    and Python will use the correct text encoding to decode the raw bytes to a str, and all your lengths will match up.

    Alternatively, when saving the files in a reasonable editor, make sure to tweak the encoding to make it an encoding Python will use by default (in modern Python, you can force UTF-8 mode regardless of locale settings, and UTF-8 is probably the most popular portable encoding, in part because for pure ASCII text, it's identical to ASCII, wasting no disk space).