Search code examples
arraysbashifs

Creating Arrays of Arrays in Bash and Sorting By Derived Values


I'm having issues with creating and sorting an array in Bash which takes its contents as lines from a command, takes certain parts of each line and operates on them before appending them to each line in the array.

To clarify, the command "bogoutil -d wordlist.db" gives output in this form:

hello 428 3654 20151116

Except that there's a few million of these lines.

I want to load each line of output the command into an array, take the absolute value of the first number minus the second, append that value onto the line in a new array, and then sort the new array by that new value.

The issue that I'm having is that I suspect that the IFS needs to change to "\n" to put each line of bogoutil output into an array, but then it needs to change again to tokenise the second and third integers in each line. Its hard to work out what my error is thus far, because there's well over 10 million lines in the array, but I can tell from the output I get that it is not what I should be getting - I think it is merely listing each line and not tokenising properly. Generally it runs for a while, prints a ton of output into the shell that is definitely not what I am expecting (I think its just a few of the tokens but definitely not all of them) and then prints

sort: cannot read: resultsarray: No such file or directory

Here is what I've written thus far

#!/bin/bash

IFS=$"\n" #set the IFS so it tokenises each line in the command
for i in $( bogoutil -d wordlist.db )
    do 
            echo $i
            OUTPUT=( ${i// \n} ) #swap out space for a newline so i can
                                 #tokenise by spaces
            BAD=${OUTPUT[1]}
            echo $BAD
            GOOD=${OUTPUT[2]}
            echo $GOOD
            DIFF=$GOOD-$BAD
            echo $DIFF
            if [ "$DIFF" -lt "0" ]
            then
                    DIFF=$DIFF \* -1
            fi
            NEWOUT="$OUTPUT $DIFF" #append the abs of the difference to
                                   #the line so i can sort by it
            resultsarray[i]=$NEWOUT
    done

sort -t " " -k 5 -g resultsarray

echo "${resultsarray[@]:0:10}"

Any assistance would be greatly appreciated. I'm really stumped here and not sure why its not working. I suspect its something to do with the way I'm trying to tokenise each line of output but I'm not sure. The other possibility (given that it lists tokens for a while and then just stops) is that there's just too many elements in the array and it runs out of allocated space. Is that a possibility?

Thanks in advance, any help you can provide is much appreciated.

EDIT: To clarify expected input and output.

A sample input would be

hello 4 1 20151116
goodbye 0 256 20151116
grant 428 3654 20151116

A expected output for that would be

grant 428 3654 20151116 3226
goodbye 0 256 20151116 256
hello 4 1 20151116 3

As you can see, its sorted by the absolute value of the difference between the first and second number. There's no negatives in the dataset, the lowest is 0.

EDIT: the awk solution below works great! I'm not sure how one would do with with Bash, but I suspect bash isn't the right way to go about it and its probably better to use awk anyway. Thanks for all the help, it was very much appreciated!


Solution

  • If I understand your question correctly (here is why it is so important to include sample output from you sample input),

     cat tst.file
     hello 428 3654 20151116
     goodby -428 3655 20151116
    

    This is assuming that the input is NOT tab-separated data. Also, if you care to update your question with a slightly larger data set I'll be happy to try confirm this is a good solution. You might also want to include the required output from your input ;-) (hint, hint).

     awk '
        function abs( num) {return (num >0) ? -num : num;} 
        {res=abs($2)+$3 ; print $0 "\t" res}' tst.file \
     | sort -t"${tabChar}" -k2n
    

    produces output like

    hello 428 3654 20151116    3226
    goodby -428 3655 20151116  3227
    

    Some sort programs support -t"\t" to define a tabChar for the sort delimiter. Mine doesn't so, I define it separately like tabChar=" " where that is a real tab Char inside the dbl-quotes.


    As I mentioned in the comments, you can simplify above (assuming std line endings from your program) like :

    bogoutil -d wordlist.db \
    | awk '....' \
    | sort -k2n
    

    IHTH