When opening file descriptors in a bash script and reading files line by line, the script terminates with a memory allocation error after processing 70K lines:
xmalloc: cannot allocate 11541 bytes (0 bytes allocated)
Environment: MINGW32 Bash: 3.1.20(4)-release (i686-pc-msys) OS: Windows 7
The size of input files: 1.2 GB each
The script follows:
#!/bin/bash
echo Left: $1
echo Right: $2
echo >"$1.diff"
echo >"$2.diff"
exec 4<"$1"
exec 5<"$2"
LINECOUNT=0
while [ $? == 0 ]
do
exec 0<&4
read LEFTLINE
exec 0<&5
read RIGHTLINE
if [ $? != 0 ]
then
exit -1
fi
LINECOUNT=$(($LINECOUNT + 1))
LINEMOD=$(($LINECOUNT % 1000))
if [[ $LINEMOD == 0 ]]
then
echo Line: $LINECOUNT
fi
if [ $LEFTLINE != $RIGHTLINE ]
then
echo $LEFTLINE >> "$1.diff"
echo $RIGHTLINE >> "$2.diff"
echo Mismatch found
fi
done
As I said above the script works for a long time, processes about 70K lines and then terminates. I assume it terminates because it uses up all the memory that a 32 bit process can take.
The purpose of the script is to open two files of the same format and length and compare them line by line. It creates two output files into where it writes out mismatching lines. I had to write the script because all comparison tools I had at my disposal crashed with "out of memory" errors or hanged. I was surprised when my script also crashed. I had to rewrite the same in C++ to make it work. Now I am trying to understand why the bash script failed. In theory it should not accumulate the file content in memory. Instead it should just read one line at a time and advance the file pointer. I am trying to understand why it crashed. Maybe there is another approach to my problem that you can recommend that I could have implemented as a bash script.
Update: Tested the following modification. It also crashed.
while IFS= read -u4 -r LEFTLINE && IFS= read -u5 -r RIGHTLINE
do
LINECOUNT=$(($LINECOUNT + 1))
LINEMOD=$(($LINECOUNT % 1000))
With the valuable input from the people in the comments to the question the solution was found. Petesh commented correctly that there was a bug (or many bugs) in previous versions of bash that caused memory leaks. Here is the link to the ticket provided by Petesh. Fortunately, the leak was fixed in more recent versions of bash. So the solution is to update bash. I installed cygwin with the bash version 4.1.17(9)-release (i686-pc-cygwin) and my script completed successfully with only 1.5 Mb of memory consumed without memory increases. John Zwinch also tested Bash 4.1.5, x86_64 and confirmed that the bug was fixed in that version too.
While resolving the issue a few improvements to the script were suggested by Mark Setchell and John Zwinck. The modifications didn't fix the problem but made the script simpler and more reliable with different file formats. The final version of the script follows:
#!/bin/bash
echo Left: $1
echo Right: $2
>"$1.diff"
>"$2.diff"
LINECOUNT=0
while IFS= read -u4 -r LEFTLINE && IFS= read -u5 -r RIGHTLINE
do
LINECOUNT=$(($LINECOUNT + 1))
LINEMOD=$(($LINECOUNT % 1000))
if [[ $LINEMOD == 0 ]]
then
echo Line: $LINECOUNT
fi
if [ "$LEFTLINE" != "$RIGHTLINE" ]
then
echo $LEFTLINE >> "$1.diff"
echo $RIGHTLINE >> "$2.diff"
echo Mismatch found
fi
done 4<"$1" 5<"$2"