Search code examples
pythondataframevectorization

vectorizing a function to use entire dataframe column instead of single value


I have a function to set colors. Currently, I loop through a dataframe and pass a single value to the function, cross reference that value to its corresponding color value and return the color value. I now want to pass the entire column from the dataframe (instead of looping through the dataframe) and return an array of color values.

Here is a simplified version of the function that currently works passing a single value (I just set the single value instead of showing the entire loop through the dataframe):

    def set_LineQualityColor(LineQ):
      data = [['grey', 0], ['cornflowerblue', 1], ['lightgreen', 2],['seagreen', 3], 
            ['mistyrose', 4], ['lightcoral', 4.1],['rosybrown', 5], ['indianred', 5.1], 
            ['lightgray', 9]]
      df = pd.DataFrame(data, columns = ['CR', 'LineQuality'])   
      c=df[df['LineQuality']==LineQ]['CR'].values[0]
    return c
    
    LQ=4
    c= set_LineQualityColor(LQ)

How can I get this to work correctly when LineQ is a column from a dataframe? i.e.

c= set_LineQualityColor(df.LQ)

Or is there a more efficient way to go about doing this? New to python. Thanks.


Solution

  • Set LineQuality as the index.

    data = [['grey', 0], ['cornflowerblue', 1], ['lightgreen', 2],['seagreen', 3], 
                ['mistyrose', 4], ['lightcoral', 4.1],['rosybrown', 5], ['indianred', 5.1], 
                ['lightgray', 9]]
    
    df = pd.DataFrame(data, columns = ['CR', 'LineQuality'])
    df.set_index(['LineQuality'], drop=True, inplace=True)
    

    Which gives this dataframe:

                             CR
    LineQuality                
    0.0                    grey
    1.0          cornflowerblue
    2.0              lightgreen
    3.0                seagreen
    4.0               mistyrose
    4.1              lightcoral
    5.0               rosybrown
    5.1               indianred
    9.0               lightgray
    

    Then lookup using loc.

    LQ_df = pd.DataFrame([1, 5, 4, 9, 4.1, 0, 4.0], columns=['LQ'])
    
    LQ = LQ_df['LQ']
    
    df.loc[LQ, 'CR']
    

    Which gives this series:

    LineQuality
    1.0    cornflowerblue
    5.0         rosybrown
    4.0         mistyrose
    9.0         lightgray
    4.1        lightcoral
    0.0              grey
    4.0         mistyrose
    

    It doesn't make sense to create the df dataframe every time you call the function, so it's better to create it once before calling the function. Then, you can define the function to use df.loc like we did before:

    data = [['grey', 0], ['cornflowerblue', 1], ['lightgreen', 2],['seagreen', 3], 
                ['mistyrose', 4], ['lightcoral', 4.1],['rosybrown', 5], ['indianred', 5.1], 
                ['lightgray', 9]]
    
    lineq_color_lookup = pd.DataFrame(data, columns = ['CR', 'LineQuality'])
    lineq_color_lookup.set_index(['LineQuality'], drop=True, inplace=True)
    
    def get_LineQualityColor(LineQ):
        return lineq_color_lookup.loc[LineQ, 'CR'] # .tolist() if you want it as a list
    

    I also changed the function name to get_LineQualityColor because the function doesn't set anything -- it only returns the color corresponding to the given LineQuality.