Search code examples
pythonpandasreadability

How to Write a plt.scatter(x, y) function in one line where y=function of x


I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.

a = []
for i in range(0,1200):
  feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
  a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)

Solution

  • On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.

    Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.

    Sample Data

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plot
    
    df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
    

    Code

    # Range of 'x' values to consider
    xarr = np.arange(0, 100)
    
    plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
    

    enter image description here