Search code examples
pythonpandasnumpyopenstreetmap

Create column based on row data when column doesn't exist or column is NaN in pandas


I have a dataframe from OSM data. In this I got everything but the colour column in my area. However in other areas the column may exist. Now I want to create the column if it is missing by providing calculated colors and also want to replace any NaN values with a color code when the column exists but a row has no color value yet.

TLDR: How do I create a colum if needed and otherwise map NaN otherwise?

I already tried just doing:

import random
def setColor(_):
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())



lines.loc[lines['colour'].isnull(),'colour'] = lines["colour"].map(setColor)

However this fails if colour doesnt exist initially.

I could run lines["colour"] = np.nan first but while that works for empty colums this doesn't work for the case when the column already partially exists. So I wonder if there is a better way.


Solution

  • It's not fully clear what you want, but maybe this is close.

    Given df1 and df2:

    import pandas as pd
    import numpy as np
    import random
    
    df1 = pd.DataFrame({'Col_01': ['x', 'y', 'z']})
    df2 = pd.DataFrame({'Col_01': ['x', 'y', 'z'], 'colour': ['#D30000', '#C21807', '']})
    
    print("df1:\n", df1)
    print("df2:\n", df2)
    

    Console output:

    df1:
       Col_01
    0      x
    1      y
    2      z
    df2:
       Col_01   colour
    0      x  #D30000
    1      y  #C21807
    2      z
    

    With a slight change to your function (removing argument) and looping through all dataframes:

    def setColor(): # change: remove the "_" here
        r = lambda: random.randint(0, 255)
        return '#%02X%02X%02X' % (r(),r(),r())
    
    for df in [df1, df2]:
        if "colour" not in df:
            df["colour"] = df.apply(lambda x: setColor(), axis=1)
        else:
            df["colour"] = np.where(df["colour"] == '', setColor(), df["colour"])
    
    print("df1:\n", df1)
    print("df2:\n", df2)
    

    Console output:

    df1:
       Col_01   colour
    0      x  #C0ACB3
    1      y  #1FA09E
    2      z  #4A35FF
    df2:
       Col_01   colour
    0      x  #D30000
    1      y  #C21807
    2      z  #D97652
    

    It's probably self-explanatory, but the loop first looks to see if the colour column exists; if not, it adds it and creates a hex code for each row. Otherwise, if the column exists, it uses np.where() to create a hex code for blank rows, otherwise keeping hex code if it's there.