Search code examples
openedgeprogress-4gl

Custom sorting and deduping


I have read data from a csv file containing many duplicate email addresses into a temp table. The format is essentially id, emailtype-description, email.

Here is an example of some data:

id    emailtype-description  email
1     E-Mail                 [email protected]
1     preferred E-mail       [email protected]
2     2nd E-mail             [email protected]
2     preferred-Email        [email protected]
2     family E-Mail          [email protected]
    cInputFile = SUBSTITUTE(cDataDirectory, "Emails").
    
    INPUT STREAM csv FROM VALUE(cInputFile).
    IMPORT STREAM csv DELIMITER "," ^ NO-ERROR.
    
    REPEAT TRANSACTION:
        CREATE ttEmail.
        IMPORT STREAM csv DELIMITER ","
            ttEmail.uniqueid   
            ttEmail.emailTypeDescription 
            ttEmail.emailAddr
            .
    END.
    
    INPUT STREAM csv CLOSE.

I want to dedupe the rows, but I don't want to do this randomly. I want to make sure that certain types take priority over others. For instance, some are marked with the type "preferred E-mail" and those should always remain if they exist, additional types take precedent over others, so "E-mail" will take precedent over "2nd-Email" or "family E-Mail".

I'd like to do in Progress code the equivalent of a custom sort of emailtype-description, then a de-dupe. That way I could define the sort order and then dedupe to retain the emails and the types by priority.

Is there a way to do this to my table in Progress? I want to sort first by uniqueid, then by emailtype-description, but I want a custom sort, not an alphabetical sort. What is the best approach?


Solution

  • When you say that you want a custom sort, not alphabetical do you mean that you want to sort by the emailtype in a non-alphabetical way? If so then I think that you would need to translate the email type into a field that sorts the way that you wish. Something along these lines:

        /* first add a field to your ttEmail called emailTypeSortOrder */
        
        define variable emailTypeSortOrderList as character no-undo.
        
        emailTypeSortOrderList = "preferred E-mail,E-mail,2nd-Email,family E-mail".
        
        cInputFile = SUBSTITUTE(cDataDirectory, "Emails").
            
            INPUT STREAM csv FROM VALUE(cInputFile).
            IMPORT STREAM csv DELIMITER "," ^ NO-ERROR.
            
            REPEAT TRANSACTION:
                CREATE ttEmail.
                IMPORT STREAM csv DELIMITER ","
                    ttEmail.uniqueid   
                    ttEmail.emailTypeDescription 
                    ttEmail.emailAddr
                    .
    
                /* classify the email type sort order
                 */
    
                ttEmail.emailTypeSortOrder = lookup( emailTypeDescription, emailTypeSortOrderList ).
                if ttEmail.emailTypeSortOrder <= 0 then emailTypeSortOrder = 9999999.
    
            END.
            
            INPUT STREAM csv CLOSE.
    

    And now you can sort and de-duplicate using the newly ordered field:

    for each ttEmail break by ttEmail.emailAddr by ttEmail.emailTypeSortOrder:
      if first-of( ttEmail.emailAddr ) then
        next. /* always keep the first one */
       else
        delete ttEmail.  /* remove unwanted duplicates... */
    end.