Search code examples
pythonmatplotlibseabornscatter-plotorange

Color Regions in a Scatter Plot


I recently found out that you can create color regions for scatter plots in Orange. I know Orange sits on top of python, so I figured I'd be able to recreate this, but I'm having a hard time. I haven't figured out how to convert a pandas dataframe for orange. More importantly, I'm working in a spark environment, so if I could go from pyspark to orange that would be better.

I've set up a basic scatter plot in both seaborn and matplotlib to see if I could figure it out.

import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset from Seaborn
iris = sns.load_dataset("iris")

# Create a scatter plot
sns.scatterplot(x="sepal_length", y="petal_width", hue="species", data=iris)

# Add labels and title
plt.xlabel("Sepal Length")
plt.ylabel("Petal Width")
plt.title("Scatter Plot of Sepal Length vs. Petal Width")

# Show the plot
plt.legend()
plt.show()

enter image description here


Solution

  • According to the Orange Documentation:

    If a categorical variable is selected in the Color section, the score is computed as follows. For each data instance, the method finds 10 nearest neighbors in the projected 2D space, that is, on the combination of attribute pairs. It then checks how many of them have the same color. The total score of the projection is then the average number of same-colored neighbors.

    You can get similar results using scikit-learn's k nearest neighbour classifier. There is an example in their docs that uses the iris dataset as well.

    I've modified this example to be more similar to the screenshot you shared:

    import matplotlib.pyplot as plt
    import seaborn as sns
    from matplotlib.colors import ListedColormap
    
    from sklearn import datasets, neighbors
    from sklearn.inspection import DecisionBoundaryDisplay
    
    n_neighbors = 10
    
    # import iris dataset
    iris = datasets.load_iris()
    
    # Select features
    features = [2, 3]
    X = iris.data[:, features]
    y = iris.target
    
    # Create color maps
    cmap_light = ListedColormap(["blue", "red", "green"])
    cmap_bold = ["blue", "red", "green"]
    
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights="distance")
    clf.fit(X, y)
    
    # Plot boundaries
    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        cmap=cmap_light,
        ax=ax,
        response_method="predict",
        plot_method="pcolormesh",
        xlabel=iris.feature_names[features[0]],
        ylabel=iris.feature_names[features[1]],
        shading="auto",
        alpha=0.3,
    )
    
    # Plot training points
    sns.scatterplot(
        x=X[:, 0],
        y=X[:, 1],
        hue=iris.target_names[y],
        palette=cmap_bold,
        alpha=1.0,
        edgecolor="black",
    )
    

    This is the result:

    Image of Iris dataset coloured by nearest neighbours