I have a file called test.txt which looks like this:
Line 1
Line 2
Line 3
Line 3
Line 3
Line 4
Line 8
I need some code which will randomize these lines BUT GUARANTEE that the same text cannot appear on consecutive lines ie "Line 3" must be split up and not appear twice or even three times in a row.
I've seen many variations of this problem answered on here but as yet, none that deal with the repetition of lines.
So far I have tested the following:
shuf test.txt
awk 'BEGIN{srand()}{print rand(), $0}' test.txt | sort -n -k 1 | awk 'sub(/\S /,"")'*
awk 'BEGIN {srand()} {print rand(), $0}' test.txt | sort -n | cut -d ' ' -f2-
cat test.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-
perl -e 'print rand()," $_" for <>;' test.txt | sort -n | cut -d ' ' -f2-
perl -MList::Util -e 'print List::Util::shuffle <>' test.txt
All of which randomize the lines within the file but often end up with the same lines appearing consecutively within the file.
Is there any way I can do this?
This is the data before edit. You can see number 82576483
appears in consecutive lines
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
NOTE: asterisks added to highlight lines of interest; asterisks do not exist in the data file
This is what I need to happen where number 82576483
is spread out across the file rather than being on consecutive lines
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
General approach:
linecnt[]
) to keep count of number of times a line is seenlinecnt[]
into two separate normal arrays: single[1]=<lineX>; single[2]=<lineY>
and multi[1]=<lineA_copy1>; multi[2]=<lineA_copy2>; multi[3]=<lineB_copy1>
single[] / multi[]
) intersperse our printing (ie, print random(single[])
, print randome(multi[])
, print random(single[])
, print randome(multi[])
); NOTE: obviously not truly random but this allows us to maximize chances of separating dupes while limiting cpu overhead (ie, no need to repetitively shuffle hoping for a 'random' ordering that splits dupes)single[]
entries left then print random(single[])
multi[]
entries left then print random(multi[]
); NOTE: assumes OP's comment re: tough!!
means dupes can be printed consecutively if this is all that's leftOne awk
idea:
$ cat dupes.awk
function print_random(a, acnt, ndx) {
ndx=int(1 + rand() * acnt)
print a[ndx]
if (acnt>1) { a[ndx]=a[acnt]; delete a[acnt] }
return --acnt
}
BEGIN { srand() }
{ linecnt[$0]++ }
END { for (line in linecnt) {
if (linecnt[line] == 1)
single[++scnt]=line
else
for (i=1; i<=linecnt[line]; i++)
multi[++mcnt]=line
delete linecnt[line]
}
while (scnt>0 && mcnt>0) {
scnt=print_random(single,scnt)
mcnt=print_random(multi,mcnt)
}
while (scnt>0)
scnt=print_random(single,scnt)
while (mcnt>0)
mcnt=print_random(multi,mcnt)
}
NOTES:
srand()
isn't truly random (eg, two quick, successive runs can generate the same exact output)srand()
)Running against OP's sample set of data:
$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N>
NOTES:
single[] / multi[]
entries and b) 2nd block finishing off the rest of the single[]
entriesAn example of processing duplicates ...
$ cat test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
Result of running our awk
script:
$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
NOTES:
single[] / multi[]
entries and b) 2nd block finishing off the rest of the multi[]
entries