Search code examples
pythonpandasdataframeduplicates

Rename duplicates accordingly in two Pandas DataFrames


I have two Pandas DataFrames: df_header and df_species.

The first contain for each row the information about a vegetation plot (longitude, latitude, country, ...) and the second one contain for each row the information about a species (name, family, genus, ...). They both have a column named RELEVE_NR, which is the ID of the vegetation plot and indicates which species was found in which vegetation plot. Here is a small example of the two DataFrames to see what they look like, with a reduced number of column for simplification:

df_header:

RELEVE_NR Longitude Latitude
12345 42.23 23.43
12346 36.85 39.25
... ... ...

df_species:

RELEVE_NR Species Family
12345 Agrostis stolonifera Poaceae
12345 Paspalum vaginatum Poaceae
12345 Scirpus lacustris Cyperaceae
12346 Phragmites australis Poaceae
... ... ...

As you can see, in a perfect world, I should have one different value in the column RELEVE_NR per row of df_header, but several time the same value in the column RELEVE_NR per row of df_species (we can imagine that each vegetation plot has a unique ID, but that each vegetation plot contains several observations of species).

However, it happens that we have several time the same value in the column df_header for several rows (I won't go into the details of why, but it can and it does happen). For these cases, here is what I have to do but can't find how:

  • except for the first time it appears, rename the duplicated values of RELEVE_NR in df_header (with values that don't exist in the dataset obviously). For example, if we have three rows that have the value 12345 in the column RELEVE_NR, we could rename the second one 5001 and the second 5002 (if we are sure that no other vegetation plots have the IDs 5001 nor 5002).

  • duplicate each row that has this duplicated value in df_species the same number of time the value appeared. For example, if the vegetation plot 12345 had 10 values associated in the df_species DataFrame, we should duplicate each of them twice in order to have 30 values. We should let the first 10 with the ID 12345, then put the number 5001 in the column RELEVE_NR for the following ten and finally the number 5002 for the last ten.

Quick example: This df_header:

RELEVE_NR Longitude Latitude
12345 42.23 23.43
12346 36.85 39.25
12346 36.85 39.25
12347 29.34 13.45
... ... ...

Should become this df_header (we changed the value of the column RELEVE_NR of the second row having the value 12346 by a new number that isn't attributed to another row):

RELEVE_NR Longitude Latitude
12345 42.23 23.43
12346 36.85 39.25
12348 36.85 39.25
12347 29.34 13.45
... ... ...

And this associated df_species:

RELEVE_NR Species Family
12345 Sorbus aucuparia Rosaceae
12346 Agrostis stolonifera Poaceae
12346 Paspalum vaginatum Poaceae
12346 Scirpus lacustris Cyperaceae
12347 Phragmites australis Poaceae
12347 Paspalum vaginatum Adoxaceae
... ... ...

Should become this df_species (we duplicated the 3 rows having the value 12346 in the column RELEVE_NR and we changed the values of the column RELEVE_NR for the duplicated rows to 12348):

RELEVE_NR Species Family
12345 Sorbus aucuparia Rosaceae
12346 Agrostis stolonifera Poaceae
12346 Paspalum vaginatum Poaceae
12346 Scirpus lacustris Cyperaceae
12348 Agrostis stolonifera Poaceae
12348 Paspalum vaginatum Poaceae
12348 Scirpus lacustris Cyperaceae
12347 Phragmites australis Poaceae
12347 Paspalum vaginatum Adoxaceae
... ... ...

I hope it is clear. Thanks for your help!


Solution

  • Example Code

    import pandas as pd
    data1 = {'RELEVE_NR': [12345, 12346, 12346, 12347],
             'Longitude': [42.23, 36.85, 36.85, 29.34],
             'Latitude': [23.43, 39.25, 39.25, 13.45]}
    data2 =  {'RELEVE_NR': [12345, 12346, 12346, 12346, 12347, 12347], 
              'Species': ['Sorbus aucuparia', 'Agrostis stolonifera', 'Paspalum vaginatum', 'Scirpus lacustris', 'Phragmites australis', 'Paspalum vaginatum'], 
              'Family': ['Rosaceae', 'Poaceae', 'Poaceae', 'Cyperaceae', 'Poaceae', 'Adoxaceae']}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    

    For convenience, I will set df1 and df2.

    Code

    Step1

    If you take the cumulative sum of the duplicated part and add it to the maximum value, the duplicated number will not be assigned.

    cond1 = df1['RELEVE_NR'].duplicated()
    s1 = df1['RELEVE_NR'].mask(cond1, cond1.cumsum().add(df1['RELEVE_NR'].max()))
    df1.assign(RELEVE_NR=s1)
    

    output:

    RELEVE_NR   Longitude   Latitude
    0   12345   42.23       23.43
    1   12346   36.85       39.25
    2   12348   36.85       39.25
    3   12347   29.34       13.45
    

    Step2

    First, concatenate df1['RELEVE_NR'] and s1 horizontally to create a single DataFrame (df3). Then, merge df2 with df3. The remaining code is minor code to create an accurate output.

    df3 = pd.concat([df1['RELEVE_NR'], s1], axis=1).set_axis(['RELEVE_NR', 'key'], axis=1)
    df2.merge(df3, how='left')\
       .assign(RELEVE_NR=lambda x: x.pop('key'))\
       .sort_values('RELEVE_NR', key=lambda x: x.map(pd.Series(s1.index, index=s1.values)))\
       .reset_index(drop=True)
    

    output:

    RELEVE_NR   Species                 Family
    0   12345   Sorbus aucuparia        Rosaceae
    1   12346   Agrostis stolonifera    Poaceae
    2   12346   Paspalum vaginatum      Poaceae
    3   12346   Scirpus lacustris       Cyperaceae
    4   12348   Agrostis stolonifera    Poaceae
    5   12348   Paspalum vaginatum      Poaceae
    6   12348   Scirpus lacustris       Cyperaceae
    7   12347   Phragmites australis    Poaceae
    8   12347   Paspalum vaginatum      Adoxaceae