Search code examples
linuxbashperlawk

Randomize txt file in Linux but guarantee no repetition of lines


I have a file called test.txt which looks like this:

Line 1
Line 2
Line 3
Line 3
Line 3
Line 4
Line 8

I need some code which will randomize these lines BUT GUARANTEE that the same text cannot appear on consecutive lines ie "Line 3" must be split up and not appear twice or even three times in a row.

I've seen many variations of this problem answered on here but as yet, none that deal with the repetition of lines.

So far I have tested the following:

shuf test.txt

awk 'BEGIN{srand()}{print rand(), $0}' test.txt | sort -n -k 1 | awk 'sub(/\S /,"")'*

awk 'BEGIN {srand()} {print rand(), $0}' test.txt | sort -n | cut -d ' ' -f2-

cat test.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-

perl -e 'print rand()," $_" for <>;' test.txt | sort -n | cut -d ' ' -f2-

perl -MList::Util -e 'print List::Util::shuffle <>' test.txt

All of which randomize the lines within the file but often end up with the same lines appearing consecutively within the file.

Is there any way I can do this?

This is the data before edit. You can see number 82576483 appears in consecutive lines

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

NOTE: asterisks added to highlight lines of interest; asterisks do not exist in the data file

This is what I need to happen where number 82576483 is spread out across the file rather than being on consecutive lines

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

Solution

  • General approach:

    • use associative array (linecnt[]) to keep count of number of times a line is seen
    • break linecnt[] into two separate normal arrays: single[1]=<lineX>; single[2]=<lineY> and multi[1]=<lineA_copy1>; multi[2]=<lineA_copy2>; multi[3]=<lineB_copy1>
    • while we have at least one entry in both arrays (single[] / multi[]) intersperse our printing (ie, print random(single[]), print randome(multi[]), print random(single[]), print randome(multi[])); NOTE: obviously not truly random but this allows us to maximize chances of separating dupes while limiting cpu overhead (ie, no need to repetitively shuffle hoping for a 'random' ordering that splits dupes)
    • if we have any single[] entries left then print random(single[])
    • if we have any multi[] entries left then print random(multi[]); NOTE: assumes OP's comment re: tough!! means dupes can be printed consecutively if this is all that's left

    One awk idea:

    $ cat dupes.awk
    
    function print_random(a, acnt,     ndx) {
        ndx=int(1 + rand() * acnt)
        print a[ndx]
        if (acnt>1) { a[ndx]=a[acnt]; delete a[acnt] }
        return --acnt
    }
    
    BEGIN { srand() }
    
          { linecnt[$0]++ }
    
    END   { for (line in linecnt) {
                if (linecnt[line] == 1)
                   single[++scnt]=line
                else
                   for (i=1; i<=linecnt[line]; i++)
                       multi[++mcnt]=line
                delete linecnt[line]
            }
    
            while (scnt>0 && mcnt>0) {
                  scnt=print_random(single,scnt)
                  mcnt=print_random(multi,mcnt)
            }
    
            while (scnt>0)
                  scnt=print_random(single,scnt)
    
            while (mcnt>0)
                  mcnt=print_random(multi,mcnt)
          }
    

    NOTES:

    • srand() isn't truly random (eg, two quick, successive runs can generate the same exact output)
    • additional steps could be added to insure quick, successive runs don't generate exact output (eg, providing an OS-level seed for use in srand())

    Running against OP's sample set of data:

    $ awk -f dupes.awk test.txt
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N>
    

    NOTES:

    • data lines cut for brevity
    • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the single[] entries
    • repeated runs will generate different results

    An example of processing duplicates ...

    $ cat test.txt
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
    

    Result of running our awk script:

    $ awk -f dupes.awk test.txt
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
    REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
    

    NOTES:

    • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the multi[] entries
    • repeated runs will generate different results