I'm sorry if this is a duplicate post but search seemed to yield no useful results...or maybe I'm such a noob that I'm not understanding what is being said in the answers.
I wrote this small code for practice (following "learning Python the hard way"). I tried to make a shorter version of a code which was already given to me.
from sys import argv
script, from_file, to_file = argv
# here is the part where I tried to simplify the commands and see if I still get the same result,
# Turns out it's the same 2n+2
trial = open(from_file)
trial_data = trial.read()
print(len(trial_data))
trial.close()
# actual code after defining the argumentative variables
in_file = open(from_file).read()
input(f"Transfering {len(in_file)} characters from {from_file} to {to_file}, hit RETURN to continue, CRTL-C to abort.")
#'in_data = in_file.read()
out_file = open(to_file, 'w').write(in_file)
When using len() it always seems to return 2n+2 value instead of n, where n is the actual number of characters in the text file. I also made sure there are no extra lines in the text file.
Can someone kindly explain?
TIA
I was expecting the exact number of characters found in the txt file to be returned. Turns out it's too much to ask.
Edit: since so many are asking for a practical example....here it goes:
The poem
dedicated to Puxijn
The Chonk one
What i get is
ÿþT h e p o e m
d e d i c a t e d t o P u x i j n
T h e C h o n k o n e
I think it is an encoding problem. I'm using the latest python if that is of any help.
Based on your updated question, you're definitely reading from UTF-16 encoded text files using the locale default encoding (probably latin-1
or cp1252
, both of which would decode the UTF-16 BOM to ÿþ
; Windows often uses cp1252
as the default, and latin-1
, while largely eclipsed by UTF-8 in the present day, was a popular locale on older UNIX-likes for a long time). Those encodings will read any old bytes without error, even if the encoding is wrong (they map one to one from all 256 bytes to a matching 256 characters), producing gibberish (for bytes outside the ASCII range), and weird gaps (for the null bytes before each ASCII character in UTF-16).
Change all your open
calls to add an extra argument, encoding='utf-16'
, e.g.:
trial = open(from_file, encoding='utf-16')
and Python will use the correct text encoding to decode the raw bytes to a str
, and all your lengths will match up.
Alternatively, when saving the files in a reasonable editor, make sure to tweak the encoding to make it an encoding Python will use by default (in modern Python, you can force UTF-8 mode regardless of locale settings, and UTF-8 is probably the most popular portable encoding, in part because for pure ASCII text, it's identical to ASCII, wasting no disk space).