Lowercasing script in Python vs Perl

In Perl, to lowercase a textfile, I could do the following lowercase.perl:

#!/usr/bin/env perl

use warnings;
use strict;

binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");

while(<STDIN>) {
  print lc($_);
}

And on the command line: perl lowercase.perl < infile.txt > lowered.txt

In Python, I could do with lowercase.py:

#!/usr/bin/env python
import io
import sys

with io.open(sys.argv[1], 'r', 'utf8') as fin:
    with io.open(sys.argv[2], 'r', 'utf8') as fout:
        fout.write(fin.read().lower())

And on the command line: python lowercase.py infile.txt lowered.txt

Is the Perl lowercase.perl different from the Python lowercase.py?

Does it stream the input and lowercase it as it outputs? Or does it read the whole file like the Python's lowercase.py?

Instead of reading in a whole file, is there a way to stream the input into Python and output the lowered case byte by byte or char by char?

Is there a way to control the command-line syntax such that it follows the Perl STDIN and STDOUT? E.g. python lowercase.py < infile.txt > lowered.txt?

Solution

There seem to be two interleaved issues here and I address that first. For how to make both Perl and Python use either invocation with a very similar behavior see the second part of the post.

Short: They differ in how they do I/O but both work line-by-line, and Python code is easily changed to allow the same command-line invocation as Perl code. Also, both can be written so to allow input either from file or from standard input stream.

(1) Both of your solutions are "streaming," in the sense that they both process input line-by-line. Perl code reads from STDIN while Python code gets data from a file, but they both get a line at a time. In that sense they are comparable in efficiency for large files.

A standard way to both read and write files line-by-line in Python is

with open('infile', 'r') as fin, open('outfile', 'w') as fout:
    fout.write(fin.read().lower())

See, for example, these SO posts on processing a very large file and read-and-write files. The way your read the file seems idiomatic for line-by-line processing, see for example SO posts on reading large-file line-by-line, on idiomatic line-by-line reading and another one on line-by-line reading.

Change the first open here to your io.open to directly take the first argument from the command line as the file name, and add modes as needed.

(2) The command line with both input and output redirection that you show is a shell feature

./program < input > output

The program is fed lines through the standard input stream (file descriptor 0). They are provided from the file input by the shell via its < redirection. From gnu bash manual (see 3.6.1), where "word" stands for our "input"

Redirection of input causes the file whose name results from the expansion of word to be opened for reading on file descriptor n, or the standard input (file descriptor 0) if n is not specified.

Any program can be written to do that, ie. act as a filter. For Python you can use

import sys   
for line in sys.stdin:
    print line.lower()

See for example a post on writing filters. Now you can invoke it as script.py < input in a shell.

The code prints to standard output, which can then be redirected by shell using >. Then you get the same invocation as for the Perl script.

I take it that the standard output redirection > is clear in both cases.

Finally, you can bring both to a nearly identical behavior, and allowing either invocation, in this way.

In Perl, there is the following idiom

while (my $line = <>) {
    # process $line
}

The diamond operator <> either takes line by line from all files submitted on the command line (which are found in @ARGV), or it gets its lines from STDIN (if data is somehow piped into the script). From I/O Operators in perlop

The null filehandle <> is special: it can be used to emulate the behavior of sed and awk, and any other Unix filter program that takes a list of filenames, doing the same to each line of input from all of them. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is empty, $ARGV[0] is set to "-" , which when opened gives you standard input. The @ARGV array is then processed as a list of filenames.

In Python you get practically the same behavior by

import fileinput
for line in fileinput.input():
    # process line

This also goes through lines of files named in sys.argv, defaulting to sys.stdin if list is empty. From fileinput documentation

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty. If a filename is '-', it is also replaced by sys.stdin. To specify an alternative list of filenames, pass it as the first argument to input(). A single file name is also allowed.

In both cases, if there are command-line arguments other than file names more need be done.

With this you can use both Perl and Python scripts in either way

lowercase < input > output
lowercase input   > output

Or, for that matter, as cat input | lowercase > output.

All methods here read input and write output line-by-line. This may be further optimized (buffered) by the interpreter, the system, and shell's redirections. It is possible to change that so to read and/or write in smaller chunks but that would be extremely inefficient and noticeably slow down programs.