Search code examples
bashshellperlsedsubstr

Efficient way to add/ append huge files


Below is a shell script that is written to process a huge file. It typically reads a fixed length file line by line, perform substring and append into another file as a delimited file. It works perfectly, but it is too slow.

array=() # Create array
       while IFS='' read -r line || [[ -n "$line" ]] # Read a line
       do
      coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
          if   [[ -z "${coOrdinates// }" ]];
          then
  echo "Not adding"
          else
  array+=("$coOrdinates")
  fi
       done < "$1_CTRL.txt"

while read -r line;
  do
          result='"'
          for e in "${array[@]}"
          do
          SUBSTRING1=`echo "$e" | sed 's/.*://'`
          SUBSTRING=`echo "$e" | sed 's/:.*//'`
          result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
          result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
          result=$result$result1'"'',''"'
          done
          echo $result >> $1_1.txt
  done < "$1.txt"

Earlier, i had used the cut command and changed as above, but there is no improvement in the time taken. Can please suggest what kind of changes can be done to improve the time taken for processing.. Thanks in advance

Update:

Sample content of the input file :

XLS01G702012        000034444132412342134

Control File :

OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
  load data
   CHARACTERSET 'UTF8'
   TRUNCATE
   into table icm_rls_clientrel2_hg
   trailing nullcols
   (
   APP_ID POSITION(1:3) "TRIM(:APP_ID)",
   RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
  )

Output file:

"LS0","1G702012 0000"

Solution

  • I suggest, with pure bash and to avoid subshells:

    if [[ $line =~ POSITION ]] ; then      # grep POSITION 
        coOrdinates="${line#*(}"           # cut -d'(' -f2
        coOrdinates="${coOrdinates%)*}"    # cut -d')' -f1
        coOrdinates="${coOrdinates/:/ }"   # cut -d':' -f1,2
        if   [[ -z "${coOrdinates// }" ]]; then
            echo "Not adding"
        else
            array+=("$coOrdinates")
        fi
    fi
    

    more efficient, by gniourf_gniourf :

    if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then 
        array+=( "${BASH_REMATCH[*]:1:2}" )
    fi
    

    similarly:

    SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
    SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )
    
    # to confirm, I don't know perl substr 
    result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )
    
    
    #result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
    # trim, if nécessary?
    result1="${result1%${result1##*[^[:space:]]}}"    # right
    result1="${result1#${result1%%[^[:space:]]*}}"    # left
    

    gniourf_gniourf suggest having the grep out of the loop:

    while read ...; do
     ...
    done < <(grep POSITION ...) 
    

    for extra efficiency: while/read loops are very slow in Bash, so prefiltering as much as possible will speed up the process quite a lot.