Search code examples
pythondataframedata-analysis

Remove duplicated value inside in cell in python with multiple row


Hi Im cleaning bunch of emails and need help of removing data on each cell.In my data frame it looks like this

I want to remove the duplicated values on each value on each cell with multiple row duplicates too.

My data

left center
One email1emailcom email2emailcom email3emailcom
Two email1emailcom email15emailcom email12emailcom
Three email15emailcom email16emailcom

Output

left center
One email1emailcom email2emailcom email3emailcom
Two email15emailcom email12emailcom
Three email16emailcom

Solution

  • If you want to keep a running track of emails you have already seen, then you likely want to leverage some global set() construct, say an actual global or a class variable or a closure. I will use the latter here:

    import pandas
    
    df = pandas.DataFrame({
        "left": [
            "One",
            "Two",
            "Three"
        ],
        "center": [
            "email1emailcom email2emailcom email3emailcom",
            "email1emailcom email15emailcom email12emailcom",
            "email15emailcom email16emailcom"
        ]
    })
    
    ## ---------------------
    ## Construct a closure that will use a set() to keep track of distinct emails
    ## and return work function that will filter out duplicates using the set
    ## from the emails string passed to it
    ## ---------------------
    def make_email_filter():
    
        distinct_emails = set()  ## keep track of distinct emails
    
        ## ---------------------
        ## Work function that will filter out duplicates including in-row
        ## and cross-row duplicates
        ## ---------------------
        def _work(emails_string):
            these_emails = []
            for email in emails_string.split():
                if email in distinct_emails:
                    continue
                distinct_emails.add(email)
                these_emails.append(email)
            return " ".join(these_emails)
        ## ---------------------
    
        ## ---------------------
        ## Return our work function back to the caller
        ## ---------------------
        return _work
        ## ---------------------
    ## ---------------------
    
    ## ---------------------
    ## Update our column based on application of the work function
    ## ---------------------
    df["center"] = df["center"].apply(make_email_filter())
    ## ---------------------
    
    print(df)
    

    That will give us a result like:

        left                                        center
    0    One  email1emailcom email2emailcom email3emailcom
    1    Two               email15emailcom email12emailcom
    2  Three                               email16emailcom
    

    If you want to ignore both emails you have already seen AND emails from a bad domain, you can just expand your test. Here I will do two tests just to be explicit, but in practice, I would combine these into one test.

    import pandas
    
    df = pandas.DataFrame({
        "left": [
            "One",
            "Two",
            "Three"
        ],
        "center": [
            "email1emailcom email2emailcom email3emailcom",
            "email1emailcom email15emailcom email12emailcom",
            "email15emailcom email16emailcom foo@amazon.com"
        ]
    })
    
    ## ---------------------
    ## Construct a closure that will use a set to keep track of distinct emails
    ## and return work function that will filter out duplicates
    ## from the emails string passed to it.
    ## Note the addition of an init parameter to allow for filtering out
    ## emails from a specific domain.
    ## ---------------------
    def make_email_filter(discard_domain=""):
    
        distinct_emails = set()  ## keep track of distinct emails
    
        ## ---------------------
        ## Work function that will filter out duplicates including in-row
        ## and cross-row duplicates
        ## ---------------------
        def _work(emails_string):
            these_emails = []
            for email in emails_string.split():
    
                ## ---------------------
                ## Ignore this email if it is from a bad domain
                ## ---------------------
                if discard_domain and email.endswith(discard_domain):
                    continue
                ## ---------------------
    
                ## ---------------------
                ## Ignore this email if we have already seen it
                ## ---------------------
                if email in distinct_emails:
                    continue
                ## ---------------------
    
                distinct_emails.add(email)
                these_emails.append(email)
            return " ".join(these_emails)
        ## ---------------------
    
        ## ---------------------
        ## Return our work function back to the caller
        ## ---------------------
        return _work
        ## ---------------------
    ## ---------------------
    
    ## ---------------------
    ## Update our column based on application of the work function
    ## ---------------------
    df["center"] = df["center"].apply(make_email_filter(discard_domain="@amazon.com"))
    ## ---------------------
    
    print(df)
    

    Again giving us:

    0    One  email1emailcom email2emailcom email3emailcom
    1    Two               email15emailcom email12emailcom
    2  Three                               email16emailcom