Search code examples
pythonpandasstringfunctionconcatenation

Concatenate rows based on a condition in Python


In this table, all columns are strings. If there is a string == "nan" in column Title, I would like to concatenate all the rows with nan string with the last values in Title != "nan." For example

From this:

Title Content Measure
Background text1 Measure 1
Method abc Measure 1
nan dfg Measure 1
Background text2 Measure 2
Method abcdfg Measure 2
Background text3 Measure 3
Method ab Measure 3
nan cd Measure 3
nan fg Measure 3
# code to create original table
data = {'Title': ['Background', 'Method', 'nan', 'Background', 'Method', 'Background', 'Method', 'nan', 'nan'], 'Content': ['text1', 'abc', 'dfg', 'text2', 'abcdfg', 'text3', 'ab', 'cd', 'fg'], 'Measure': ['Measure1', 'Measure1', 'Measure1', 'Measure2', 'Measure2', 'Measure3', 'Measure3', 'Measure3', 'Measure3',]} 

df = pd.DataFrame(data)  

df

To this:

Title Content Measure
Background text1 Measure 1
Method abcdfg Measure 1
Background text2 Measure 2
Method abcdfg Measure 2
Background text3 Measure 3
Method abcdfg Measure 3

I need a function because my real data set has multiple rows that need to be fixed.


Solution

  • you can use:

    df['last_nan_value']=df['Title'].fillna(method='ffill') #fill nans with last nan value in new column
    
    df['id'] = ((df.last_nan_value != df.last_nan_value.shift())).cumsum() #assing an id for each title until title change
    
    print(df)
    '''
        Title       Content Measure     last_nan_value  id
    0   Background  text1   Measure 1   Background      1
    1   Method      abc     Measure 1   Method          2
    2   nan         dfg     Measure 1   Method          2
    3   Background  text2   Measure 2   Background      3
    4   Method      abcdfg  Measure 2   Method          4
    5   Background  text3   Measure 3   Background      5
    6   Method      ab      Measure 3   Method          6
    7   nan         cd      Measure 3   Method          6
    8   nan         fg      Measure 3   Method          6
    
    
    '''
    
    #group by id and use agg functions
    dfx=df.groupby('id').agg({'Title':'first','Content':'sum','Measure':'first'})
    dfx
    
    id  Title       Content Measure
    1   Background  text1   Measure 1
    2   Method      abcdfg  Measure 1
    3   Background  text2   Measure 2
    4   Method      abcdfg  Measure 2
    5   Background  text3   Measure 3
    6   Method      abcdfg  Measure 3