I have two Pandas DataFrames: df_header and df_species.
The first contain for each row the information about a vegetation plot (longitude, latitude, country, ...) and the second one contain for each row the information about a species (name, family, genus, ...). They both have a column named RELEVE_NR, which is the ID of the vegetation plot and indicates which species was found in which vegetation plot. Here is a small example of the two DataFrames to see what they look like, with a reduced number of column for simplification:
df_header:
RELEVE_NR | Longitude | Latitude |
---|---|---|
12345 | 42.23 | 23.43 |
12346 | 36.85 | 39.25 |
... | ... | ... |
df_species:
RELEVE_NR | Species | Family |
---|---|---|
12345 | Agrostis stolonifera | Poaceae |
12345 | Paspalum vaginatum | Poaceae |
12345 | Scirpus lacustris | Cyperaceae |
12346 | Phragmites australis | Poaceae |
... | ... | ... |
As you can see, in a perfect world, I should have one different value in the column RELEVE_NR per row of df_header, but several time the same value in the column RELEVE_NR per row of df_species (we can imagine that each vegetation plot has a unique ID, but that each vegetation plot contains several observations of species).
However, it happens that we have several time the same value in the column df_header for several rows (I won't go into the details of why, but it can and it does happen). For these cases, here is what I have to do but can't find how:
except for the first time it appears, rename the duplicated values of RELEVE_NR in df_header (with values that don't exist in the dataset obviously). For example, if we have three rows that have the value 12345 in the column RELEVE_NR, we could rename the second one 5001 and the second 5002 (if we are sure that no other vegetation plots have the IDs 5001 nor 5002).
duplicate each row that has this duplicated value in df_species the same number of time the value appeared. For example, if the vegetation plot 12345 had 10 values associated in the df_species DataFrame, we should duplicate each of them twice in order to have 30 values. We should let the first 10 with the ID 12345, then put the number 5001 in the column RELEVE_NR for the following ten and finally the number 5002 for the last ten.
Quick example: This df_header:
RELEVE_NR | Longitude | Latitude |
---|---|---|
12345 | 42.23 | 23.43 |
12346 | 36.85 | 39.25 |
12346 | 36.85 | 39.25 |
12347 | 29.34 | 13.45 |
... | ... | ... |
Should become this df_header (we changed the value of the column RELEVE_NR of the second row having the value 12346 by a new number that isn't attributed to another row):
RELEVE_NR | Longitude | Latitude |
---|---|---|
12345 | 42.23 | 23.43 |
12346 | 36.85 | 39.25 |
12348 | 36.85 | 39.25 |
12347 | 29.34 | 13.45 |
... | ... | ... |
And this associated df_species:
RELEVE_NR | Species | Family |
---|---|---|
12345 | Sorbus aucuparia | Rosaceae |
12346 | Agrostis stolonifera | Poaceae |
12346 | Paspalum vaginatum | Poaceae |
12346 | Scirpus lacustris | Cyperaceae |
12347 | Phragmites australis | Poaceae |
12347 | Paspalum vaginatum | Adoxaceae |
... | ... | ... |
Should become this df_species (we duplicated the 3 rows having the value 12346 in the column RELEVE_NR and we changed the values of the column RELEVE_NR for the duplicated rows to 12348):
RELEVE_NR | Species | Family |
---|---|---|
12345 | Sorbus aucuparia | Rosaceae |
12346 | Agrostis stolonifera | Poaceae |
12346 | Paspalum vaginatum | Poaceae |
12346 | Scirpus lacustris | Cyperaceae |
12348 | Agrostis stolonifera | Poaceae |
12348 | Paspalum vaginatum | Poaceae |
12348 | Scirpus lacustris | Cyperaceae |
12347 | Phragmites australis | Poaceae |
12347 | Paspalum vaginatum | Adoxaceae |
... | ... | ... |
I hope it is clear. Thanks for your help!
Example Code
import pandas as pd
data1 = {'RELEVE_NR': [12345, 12346, 12346, 12347],
'Longitude': [42.23, 36.85, 36.85, 29.34],
'Latitude': [23.43, 39.25, 39.25, 13.45]}
data2 = {'RELEVE_NR': [12345, 12346, 12346, 12346, 12347, 12347],
'Species': ['Sorbus aucuparia', 'Agrostis stolonifera', 'Paspalum vaginatum', 'Scirpus lacustris', 'Phragmites australis', 'Paspalum vaginatum'],
'Family': ['Rosaceae', 'Poaceae', 'Poaceae', 'Cyperaceae', 'Poaceae', 'Adoxaceae']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
For convenience, I will set df1 and df2.
Code
Step1
If you take the cumulative sum of the duplicated part and add it to the maximum value, the duplicated number will not be assigned.
cond1 = df1['RELEVE_NR'].duplicated()
s1 = df1['RELEVE_NR'].mask(cond1, cond1.cumsum().add(df1['RELEVE_NR'].max()))
df1.assign(RELEVE_NR=s1)
output:
RELEVE_NR Longitude Latitude
0 12345 42.23 23.43
1 12346 36.85 39.25
2 12348 36.85 39.25
3 12347 29.34 13.45
Step2
First, concatenate df1['RELEVE_NR'] and s1 horizontally to create a single DataFrame (df3). Then, merge df2 with df3. The remaining code is minor code to create an accurate output.
df3 = pd.concat([df1['RELEVE_NR'], s1], axis=1).set_axis(['RELEVE_NR', 'key'], axis=1)
df2.merge(df3, how='left')\
.assign(RELEVE_NR=lambda x: x.pop('key'))\
.sort_values('RELEVE_NR', key=lambda x: x.map(pd.Series(s1.index, index=s1.values)))\
.reset_index(drop=True)
output:
RELEVE_NR Species Family
0 12345 Sorbus aucuparia Rosaceae
1 12346 Agrostis stolonifera Poaceae
2 12346 Paspalum vaginatum Poaceae
3 12346 Scirpus lacustris Cyperaceae
4 12348 Agrostis stolonifera Poaceae
5 12348 Paspalum vaginatum Poaceae
6 12348 Scirpus lacustris Cyperaceae
7 12347 Phragmites australis Poaceae
8 12347 Paspalum vaginatum Adoxaceae