Search code examples
pythoninputline-breaks

Why Python does not see all the rows in a file?


I count number of rows (lines) in a file using Python in the following method:

n = 0
for line in file('input.txt'):
   n += 1
print n

I run this script under Windows.

Then I count the number of rows in the same file using Unix command:

wc -l input.txt

Counting with Unix command gives a significantly larger number of rows.

So, my question is: Why Python does not see all the rows in the file? Or is it a question of definition?


Solution

  • You most likely have a file with one or more DOS EOF (CTRL-Z) characters in it, ASCII codepoint 0x1A. When Windows opens a file in text mode, it'll still honour the old DOS semantics and end a file whenever it reads that character. See Line reading chokes on 0x1A.

    Only by opening a file in binary mode can you bypass this behaviour. To do so and still count lines, you have two options:

    • read in chunks, then count the number of line separators in each chunk:

      def bufcount(filename, linesep=os.linesep, buf_size=2 ** 15):
          lines = 0
          with open(filename, 'rb') as f:
              last = ''
              for buf in iter(f.read, ''):
                  lines += buf.count(linesep)
                  if last and last + buf[0] == linesep:
                      # count line separators straddling a boundary
                      lines += 1
                  if len(linesep) > 1:
                      last = buf[-1]
          return lines
      

      Take into account that on Windows os.linesep is set to \r\n, adjust as needed for your file; in binary mode line separators are not translated to \n.

    • Use io.open(); the io set of file objects open the file in binary mode always, then do the translations themselves:

      import io
      
      with io.open(filename) as f:
          lines = sum(1 for line in f)