Search code examples
pandasdataframelistmatplotlibscatter-plot

Getting the edgecolors to be dependent on the size of the circle in the scatter() plot


I am trying to make a scatter plot. I have tried making the edgecolors dependent on the size of the bubble, by passing a pandas.core.series.Series (outliner_colors) into the edgecolors argument of the scatter() function (sorry if some terminology is wrong, am quite new to it!).

The issue I am having is that even when the bubble_size_filtered is less than 700,000, I am still getting outlines on the circles in my graph, despite the fourth "0" in the RGBA tuple, which should make the outline transparent I believe. Interestingly, when I set the limit to 7,000,000 (no bubble_sizes_filtered are above this), ALL the bubbles do not have outlines. So I think the (0,0,0,0) tuple is working to erase the outlines, but for some reason the selection of the bubble sizes isn't. bubble_size_filtered is a pandas.core.series.Series (checked with type()).

I have done print(bubble_size_filtered) and get:

0        120.0
1       2000.0
2       5000.0
3       3000.0
4        360.0
         ...
2042      21.0
2044      15.0
2045      85.0
2046     100.0
2047      36.0

So clearly some of the values are lower than 700,000... I don't know what is happening.

See here what (I think!) are the relevant pieces of code:

    outliner_colors = []
    for size in bubble_size_filtered:
        if size > 700000:
            # Append "black" if size > 700000
            outliner_colors.append("black")
        else:
            # Append RGBA tuple (0, 0, 0, 0) if size <= 700000
            outliner_colors.append((0, 0, 0, 0))

        ax.scatter(xvar_filtered, yvar_filtered, s=scaled_dot_size, c=dot_colors, edgecolors=outliner_colors, alpha=0.5-large_ghosting_scale)

And here is my full code, if useful - or at least the code for the plotting function

def plot_cdr_graph_without_mcsps(dataframe, xvar, yvar, bubble_size, removed_categories, threshold_bubble_size):
        # Convert 'bubble_size' to numeric
    dataframe[bubble_size] = pd.to_numeric(dataframe[bubble_size], errors="coerce")
    
    # Converting the yvar to numeric
    dataframe[yvar] = pd.to_numeric(dataframe[yvar], errors="coerce")
    
    # Generate data for the prices per credit, if the user has inputted the price as their yvar and tons purchased as the bubble size.
    if yvar == "price_usd":    
        dataframe[yvar] = dataframe[yvar]/dataframe[bubble_size]
        # I am now replacing the inf values with a large number, before rounding them to that large number. When I say large number, I mean one with lots of digits.
        # We need to change this - I think I was wrong in thinking that the .33333333333s were being represented as infs. perhaps there is another reason we have an inf value.
        # Currently, I am just replacing the inf value with 1e10 which ruins the graph, (although it at least allows the graph to be plotted).
        # dataframe[yvar].replace([np.inf, -np.inf], 1e10, inplace=True)
        # dataframe[yvar] = dataframe[yvar].round()
    else:
        print("This function does not facilitate a scenario in which the yvar is not price!")

    # Filter out rows with 'None' values in 'xvar'
    dataframe = dataframe.dropna(subset=[xvar])
    
    # Filter out rows with 'None' values in 'yvar'
    dataframe = dataframe.dropna(subset=[yvar])

    # If the xvar variable is announcement date, then do this to format it correctly.
    if xvar == "announcement_date":
        dataframe[xvar] = pd.to_datetime(dataframe[xvar]) # change to datetime format.
        dataframe[xvar] = dataframe[xvar].dt.strftime('%Y-%m-%d %H:%M:%S')  # Convert datetime to string
        dataframe[xvar] = dataframe[xvar].apply(lambda x: x[:10])
        print(f"These are dates: {dataframe[xvar]}")
    
    # This is done to remove the nan values and replace with "Unspecified", which will be colour coded as grey in the colour dict.
    mask = pd.isnull(dataframe["method"])
    dataframe.loc[mask, "method"] = "Unspecified"

    #Reset the index to make it equal to the rows in the dataframe.
    dataframe.reset_index(drop=True, inplace=True)

    # Filter out rows with excluded ('BECCS') method and non-None values in bubble_size and non-None values in yvar. This is designed to make all the columns the same length.
    mask = (~dataframe[xvar].isin(removed_categories)) & (~dataframe[xvar].isnull()) & (~dataframe[bubble_size].isnull()) & (dataframe[bubble_size] != 0) & (dataframe[bubble_size] >= threshold_bubble_size)
    xvar_filtered = dataframe.loc[mask, xvar]
    bubble_size_filtered = dataframe.loc[mask, bubble_size]
    yvar_filtered = dataframe.loc[mask, yvar]
    # Some info on what we are doing here -----------------------------------
        # mask, eturns a pandas data series. This series has indicies and bools. The bools correspond to True and False values. Above I have reset the index of the dataframe to make it
        # correspond to the rows of the dataframe. You will see that the mask series has roughly 320 rows (21/02/2024). This is the number of rows with info on:
        # xvar, yvar, bubble_size. AND that is not removed_category OR below threshold_bubble_size.
        # The above could seem confusing because of the use of the ~ sign. This is responsible for acting as a Boolean logical operator negator in the pandas DataFrame. So it will FLIP the True and False Boolean values.
    # -----------------------------------------------------------------------

    # Scale the size of all the dots by the number of tonnes purchased.
    scaling_factor = 0.12
    scaled_dot_size = bubble_size_filtered * scaling_factor

    # Dot colours by CDR Methodology. This is designed to always correspond to CDR methodology, and not change as the axes of the graph change.
    cdr_method = dataframe.loc[mask, "method"]
    cdr_colors = {"Biochar": "black", 
                  "Enhanced Weathering": "blue", 
                  "Mineralization": "#987F18", 
                  "Biomass Removal": "#0a7d29", 
                  "DAC": "purple", 
                  "Biooil": "orange", 
                  "Direct Ocean Removal": "#55B7B4", 
                  "Microalgae": "#589F39",
                  "Macroalgae": "lime",
                  "Ocean Alkalinity Enhancement": "navy",
                  "BECCS": "sienna",
                  "Unspecified": "dimgrey"} # We need this at the end, because sometimes we change xvar to announcement_date or something. Therefore the mask won't work on rows with no listed "method". I will colour this the same as Unspecified.
    dot_colors = [cdr_colors[method] for method in cdr_method]


    # This clever bit of code is designed to scale the transparency of the bubble to the size of the bubble - larger bubbles that cover others will therefore be less obstructive. We have hard-coded 3000000 as the upper limit for tons_purchased, as we know the largest Microsoft one in the databased is under this. May need to change in future!
    large_ghosting_scale = 0.4*(bubble_size_filtered/3000000)
    print(f"ifdhbjsdkf: {type(large_ghosting_scale)}")

    # This bit of code is supposed to define whether a bubble has an outline or not.
    # Still more work needed here. I don't think the size number here corresponds to the bubble size really.
    #outliner_colors = bubble_size_filtered.apply(lambda size: "black" if size > 700000 else (0, 0, 0, 0)) # This is a very obscure bit of code to find. So you need to actually use an RGBA colour code. This is a tuple of three or four numbers. (R, G, B, A). Red, Green, Blue components along with alpha for transparency. For some reason they were not letting me use "none" for no outline.
    

    # Assuming bubble_size_filtered is your pandas.core.series.Series
    outliner_colors = []

    # Iterate through the values in bubble_size_filtered
    for size in bubble_size_filtered:
        if size > 700000:
            # Append "black" if size > 700000
            outliner_colors.append("black")
        else:
            # Append RGBA tuple (0, 0, 0, 0) if size <= 700000
            outliner_colors.append((0, 0, 0, 0))


        
    # This clever little function is responsible for taking in strings which correspond to the human input (which are the columns in the csv, and changing them to labels)
    def human_input_to_labels(input):
        if input == "method":
            return "CDR Method"
        elif input == "tons_purchased":
            return "No. tCO2c in order"
        elif input == "price_usd":
            return "Price per credit (USD/tCO2c)"
        elif input == "announcement_date":
            return "Date of Purchase Order Announcement"
        else:
            pass
        
    # PLOTTING FUNCTION Plot the graph only if both series have the same length
    if len(xvar_filtered) == len(yvar_filtered) == len(bubble_size_filtered):
        
        text_color = "#E5E5E5"
        background_colour = "#565656"
        chart_colour = "#C8C8C8"
        axes_widths = 1.2
        
        fig, ax = plt.subplots()
        ax.scatter(xvar_filtered, yvar_filtered, s=scaled_dot_size, c=dot_colors, edgecolors=outliner_colors, alpha=0.5-large_ghosting_scale) # Note that the xvar_filtered, yvar_filtered, scaled_dot_size are pandas.core.series.Series's while the dot_colors is a list. The large_ghosting_scale is also a pandas.core.series.Series.
        ax.set_xlabel(human_input_to_labels(xvar), fontweight="bold", fontname="Gill Sans MT", color=text_color)
        ax.set_ylabel(human_input_to_labels(yvar), fontweight="bold", fontname="Gill Sans MT", color=text_color)
        ax.set_title("CDR Graph: Market Carbon Credit Prices vs. CyanoCapture\nMinimum Credit Selling Prices", fontweight="bold", fontname="Gill Sans MT", color=text_color, fontsize=15)
        ax.tick_params(axis="x", colors=text_color, labelrotation=0, labelsize=8) # Note that color will only change tick colour, while colors will change both tick and label colours.
        ax.tick_params(axis="y", colors=text_color)
        ax.grid(True, color="black", alpha=0.2)
        ax.spines['bottom'].set_linewidth(axes_widths)  # Set thickness of the bottom axis
        ax.spines["bottom"].set_color(text_color)  
        ax.spines['left'].set_linewidth(axes_widths)    # Set thickness of the left axis
        ax.spines["left"].set_color(text_color) 
        ax.spines['top'].set_linewidth(0)     # Set thickness of the top axis
        ax.spines['right'].set_linewidth(0)   # Set thickness of the right axis
        ax = plt.gca()
        for tick in ax.get_xticklabels():
            tick.set_fontweight('bold')
        for tick in ax.get_yticklabels():
            tick.set_fontweight('bold')
        ax.invert_xaxis() # For some reason when time on x axis, wrong way round. This fixes that
        ax.axhline(y=0, color=background_colour, linestyle='--', linewidth=1) # Adding a line at the y axis
        max_credit_price = np.nanmax(yvar_filtered)
        ax.set_ylim(-100, max_credit_price)
        ax.set_facecolor(chart_colour)
        ax.xaxis.labelpad = 55 # This is designed to space out the x axis label from the x axis data labels:
        plt.subplots_adjust(bottom=0.4)
        fig.patch.set_facecolor(background_colour) # HexDec code for dark grey.
        
        
        
        # This is designed to space out the labels along the x axis:
        if xvar == "announcement_date":
            custom_tick_positions = range(0, len(xvar_filtered), 8)  # Example: Tick every 2 units
            ax.xaxis.set_major_locator(FixedLocator(custom_tick_positions))
        else:
            pass

        # LEGEND FORMATTING ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        for method, color in cdr_colors.items():
            plt.scatter([], [], color=color, label=method)
        # Customize legend
        plt.legend(title='CDR Method', loc='upper left', fontsize='small')
        #This is responsible for putting the legend as a horizontally inclined rectange at the bottom of the plot. Change the second bbox argument to change the % below the plot the legened is.
        plt.legend(loc='lower right', bbox_to_anchor=(0.5, -0.60), ncol=3, fancybox=True)
        # ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        plt.show()

    else:
        print("Error: Length of 'xvar_filtered', 'bubble_size_filtered' and 'yvar_filtered' are not the same.")

Solution

  • It could be that your edgecolors= alpha values are being overridden by your alpha= argument. Try removing the alpha= argument or setting it to None - that should allow the edgecolors= to control the alphas instead.