Search code examples
pythonseabornscatter-plot

Scatter plot multiple features against one specifc feature in a dataset


Edited:

I have a dataset that has 10 features, and a binary classification column.

The dataset looks as follows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      100 non-null    float64
 1   x2      100 non-null    float64
 2   x3      100 non-null    float64
 3   x4      100 non-null    float64
 4   x5      100 non-null    float64
 5   x6      100 non-null    float64
 6   x7      100 non-null    float64
 7   x8      100 non-null    float64
 8   x9      100 non-null    float64
 9   x10     100 non-null    float64
 10  y       100 non-null    int64  
dtypes: float64(10), int64(1)
memory usage: 8.7 KB
time: 41.6 ms (started: 2021-12-27 10:30:27 +00:00)

I have already plotted these features against one specific feature x10 in a pair plot. It is shown below:

enter image description here

However, I want to separate these plots and have multiple scatter plots (x10 feature against one feature at a time for all other 9 features)

I have written the code below:

# Generate some data
df = pd.DataFrame({
    'x1': np.random.normal(0, 1, 100),
    'x2': np.random.normal(0, 1, 100),
    'x3': np.random.normal(0, 1, 100),
    'x4': np.random.normal(0, 1, 100),
    'x5': np.random.normal(0, 1, 100),
    'x6': np.random.normal(0, 1, 100),
    'x7': np.random.normal(0, 1, 100),
    'x8': np.random.normal(0, 1, 100),
    'x9': np.random.normal(0, 1, 100),
    'x10': np.random.normal(0, 1, 100),
    'y': np.random.choice([0, 1], 100)})


# split data into X and y
X = df.iloc[:, :10]

# specifying columns and rows for the plot
X_cols = X.columns
y_rows = ['x10']

# # pair plot
# sns_plot = sns.pairplot(data = df, x_vars=X_cols, y_vars=y_rows, hue = 'y', palette='RdBu')

# multiple scatter plots
for feature in X_cols:
   sns.scatterplot(data = df[feature], x=feature , y='x10', hue = 'y', palette='RdBu')
   plt.show()

I'm getting this error:

ValueError                                Traceback (most recent call last)
<ipython-input-24-ad3cca752a2e> in <module>()
     26 # multiple scatter plots
     27 for feature in X_cols:
---> 28    sns.scatterplot(data = df[feature], x=feature , y='x10', hue = 'y', palette='RdBu')
     29    plt.show()
     30 

5 frames
/usr/local/lib/python3.7/dist-packages/seaborn/_core.py in _assign_variables_longform(self, data, **kwargs)
    901 
    902                 err = f"Could not interpret value `{val}` for parameter `{key}`"
--> 903                 raise ValueError(err)
    904 
    905             else:

ValueError: Could not interpret value `x1` for parameter `x`

Can I know what I'm doing wrong ? and how can I fix this issue to get my desired output ?


Solution

  • Addressing the original problem and question, there are three mistakes:

    • indexing a list with a list item, instead of an index (integer)
    • using a list for the y parameter in scatterplot, instead of the column name
    • using a specific column for the data parameter, instead of the full dataframe

    In addition, there was the needless conversion of the columns attribute to a list, then iterating over that list, instead of directly iterating over the columns attribute.

    The correct code removes the assigments for cols_X and rows_y, and simplifies the loop to the following:

    for feature in cols_X.columns:
        sns.scatterplot(data=normalized_df, x=feature, y='time', hue='binary result', palette='RdBu')
        plt.show()
    

    (note that cols_X has to be a subset, column-wise, of normalized_df, so that at least it doesn't include the "time" column, to avoid creating a scatter plot of "time" versus "time". Or that case could be ignored with a quick if feature == "time": continue just above the sns.scatterplot line.)


    For comparison, this was the original code:

    # relatively irrelevant above omitted
    
    cols_X = X.columns.to_list()
    rows_y = ['time']
    
    for feature in cols_X:
      sns.scatterplot(data = normalized_df[feature], x= cols_X[feature], y= rows_y , hue = 'binary result', palette='RdBu')
      plt.show()