I'm wondering why a simple line count using bash is giving me a different number of lines than that computed using python (version 3.6) for the files given here (train_en.txt
) and here (train_de.txt
). In bash, I'm using the command:
wc -l train_en.txt
wc -l train_de.txt
The outputs are 4520620 and 4520620, respectively.
In python, I'm using the commands:
print(sum(1 for line in open('train_en.txt')))
print(sum(1 for line in open('train_de.txt')))
The outputs are 4521327 and 4521186, respectively.
When I use the python commands
len(open('train_en.txt').read().splitlines())
len(open('train_de.txt').read().splitlines())
I get 4521334 and 4521186, respectively (for which the train_en.txt
results don't match those of the previous python command).
For reference, these are parallel corpora of text produced by concatenating the Common Crawl, Europarl, and News Commentary datasets (in that order) from the WMT '14 English to German translation task and should have the same number of lines.
\n
s can be treated as multi-byte characters rather than as an actual \n
. One can avoid this by using bytestring encoding. The commands
print(sum(1 for line in open('train_en.txt', mode='rb')))
print(sum(1 for line in open('train_de.txt', mode='rb')))
len(open('train_en.txt', mode='rb').read().splitlines())
len(open('train_de.txt', mode='rb').read().splitlines())
all result in 4520620 (matching the output of wc -l
), which means that the English and German corpora are parallel as desired.
Thanks to @CharlesDuffy for the help.