Search code examples
stringcobol

Remove Duplicate string(s) from data set


I have a input file like below :

  1. A->B->C->E
  2. A->B->C->D
  3. B->C->D
  4. C->D
  5. D->E ........ ........


My requirement is to write only unique string in output file. If any substring is repeated in any record then do not write in output file.

Output file should be like below :

  1. A->B->C->E
  2. A->B->C->D
  3. D->E

Skip the record 3rd and 4th as these strings are already present in 2nd record.

How can I achieve this through COBOL or Utility program?


Solution

    1. Sort the input into descending order by the reverse value of the string and its length.
    2. Match the trailing strings for two adjacent records, dropping the shorter matching record.
    3. Sort the remaining records into their original sequence.

    For the examples given:

    Input:

    A->B->C->E
    A->B->C->D
    B->C->D
    C->D
    D->E
    

    Released to sort (Phase-1 output):

    00001 10 E>-C>-B>-A
    00002 10 D>-C>-B>-A
    00003 07 D>-C>-B
    00004 04 D>-C
    00005 04 E>-D
    

    Returned from sort (Phase-2 input):

    00005 04 E>-D
    00001 10 E>-C>-B>-A
    00002 10 D>-C>-B>-A
    00003 07 D>-C>-B
    00004 04 D>-C
    

    Records 3 and 4 match the trailing characters of record 2 and will be dropped.

    Phase-2 output (edited):

    00005 D->E
    00001 A->B->C->E
    00002 A->B->C->D                              
    

    Output (Resequenced):

    A->B->C->E
    A->B->C->D
    D->E
    

    In the following code, the display statements were used only to make a record of the activity normally hidden.

    Code:

       environment division.
       input-output section.
       file-control.
           select word-out assign "E:w2out.txt"
               organization line sequential.
           select word-list assign "E:w2in.txt"
               organization line sequential.
           select ph-2-wrk assign "ph-2.txt"
               organization sequential.
           select sort-work-1 assign "sortwork.dat".
           select sort-work-2 assign "sortwork.dat".
       data division.
       file section.
       fd word-out.
       01 word-out-rec pic x(40).
       fd word-list.
       01 word-rec pic x(40).
       fd ph-2-wrk.
       01 ph-2-rec.
           02 ph-2-seq pic 9(5).
           02 ph-2-word pic x(40).
       sd sort-work-1.
       01 sort-1-rec.
           02 sort-1-seq pic 9(5).
           02 sort-1-len pic 9(2).
           02 sort-1-word pic x(40).
       sd sort-work-2.
       01 sort-2-rec.
           02 sort-2-seq pic 9(5).
           02 sort-2-word pic x(40).
       working-storage section.
       01 word-len pic 99 value 0.
       01 seq-num pic 9(5) value 0.
       01 comp-word pic x(40) value high-values.
       procedure division.
           sort sort-work-1
                   descending sort-1-word sort-1-len
               input procedure phase-1
               output procedure phase-2
           sort sort-work-2
                   ascending sort-2-seq
               using ph-2-wrk
               output procedure write-output-list
           stop run
           .
       phase-1.
           display "Released to sort:"
           open input word-list
           perform until exit
               read word-list
               at end exit perform
               end-read
               perform get-word-len
               add 1 to seq-num
               move seq-num to sort-1-seq
               move word-len to sort-1-len
               move function reverse (word-rec (1:word-len))
                   to sort-1-word
               display sort-1-seq space sort-1-len space sort-1-word
               release sort-1-rec
           end-perform
           close word-list
           .
       get-word-len.
           move 0 to word-len
           inspect word-rec tallying word-len
               for characters before space
           .
       phase-2.
           display "Returned from sort:"
           open output ph-2-wrk
           perform until exit
               return sort-work-1
               at end exit perform
               end-return
               display sort-1-seq space sort-1-len space sort-1-word
               if sort-1-word (1:sort-1-len)
                       not = comp-word (1:sort-1-len)
                 or sort-1-len = 1
                   move sort-1-word to comp-word
                   move function reverse (sort-1-word (1:sort-1-len))
                       to ph-2-word
                   move sort-1-seq to ph-2-seq
                   write ph-2-rec
               end-if
           end-perform
           close ph-2-wrk
           .
       write-output-list.
           open output word-out
           perform until exit
               return sort-work-2
               at end exit perform
               end-return
               write word-out-rec from sort-2-word
           end-perform
           close word-out
           .
    

    This program was developed and tested with a word list containing 69,904 words and without the display statements. Hence the five-digit size of the sequence number and 40-character words. The need to reverse the text to capture the trailing substrings and to reverse again for output appears to be the bottleneck for speed.