Search code examples
pythonmatplotlibscipy

Gaussian curve stays almost unchanged when outliers are removed


So I have a program that plots the data (126 values) I have as a bar chart and fits a Gaussian curve to it:

from matplotlib.pyplot import *
from numpy import *
from scipy.stats import norm
f=open("data.txt")
items=[]
for i in range(0, 500000, 20000):
    items.append(0)
arr=f.readlines()
for i in range(len(arr)):
    arr[i] = int(arr[i])
for i in range(len(arr)):
    for j in range(0, 500000, 20000):
        if arr[i] < j:
            items[j//20000] += 1
            break
yticks(range(32), fontsize = 5)
xticks(list(range(len(items))), list(range(0, 500, 20)), fontsize = 5)
xlabel("Стоимость, тыс. руб")
ylabel("Количество квартир")
bar(list(range(25)), items, 1, edgecolor = "black")
t_mean = mean(items)
t_variance = var(items)
t_sigma = t_variance ** 0.5
t_x = linspace(-5, 25,50)
plot(t_x,norm.pdf(t_x,t_mean,t_sigma) *500)
show()

Here's the result: enter image description here

Then I remove 6 biggest values from the dataset, and here's the new result:

enter image description here

You can see that the outliers are gone from the diagram, but the Gaussian curve is almost unchanged (if you open these graphs side by side and look REALLY closely you can notice the difference, but otherwise the curves look exactly the same). Is this normal?

data.txt

450000
325000
325000
320000
315000
300000
170000
160000
152000
150000
150000
148000
145000
130000
130000
129000
125000
125000
125000
115000
111000
110000
108000
105000
105000
103000
100000
100000
100000
100000
100000
100000
100000
100000
100000
100000
100000
100000
100000
95000
94000
90000
90000
90000
90000
80000
79000
78000
77000
77000
75000
75000
75000
75000
75000
75000
75000
75000
75000
75000
74000
72000
70000
70000
70000
60000
60000
60000
60000
60000
55000
55000
55000
55000
54000
54000
53000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
49000
48000
47000
45000
45000
45000
45000
45000
40000
40000
40000
40000
40000
38000
38000
37000
35000
35000
35000
32000
32000
32000
30000
30000
30000
30000
30000
30000
30000
28000
28000
25000
23000
23000
21000
18500
18500
18500
18000

Solution

  • I'm sorry to say this, but I think you should start over. Your code is impossible to understand (for me). I recommend starting from a working example using norm.pdf, such as the first answer to this question, and modify the code to fit your problem. You are using a lot of numbers that to a person that did not write your code seem to come from nowhere, especially because we do not have your data. For larger programs this becomes impossible to debug.

    I am also pretty sure that you what you are currently doing: taking the mean and standard deviation of your items is not what you should be doing.

    Edit: Using the data you provided I made a short example of how you could create a standard distribution out of it:

    import numpy as np
    from scipy.stats import norm
    import matplotlib.pyplot as plt
    
    
    arr = np.loadtxt("data.txt")
    
    x_lin = np.linspace(0, 5e5, 1000)
    standard_distr = norm.pdf(x_lin, np.mean(arr), np.std(arr))
    plt.hist(arr, bins=30)
    plt.plot(x_lin, standard_distr/np.max(standard_distr)*25)
    plt.show()