To make things clearer, I don't want to remove the entire bin from the histogram, I just want to get rid of some of the data so that it is brought below a desired frequency. The line in the image shows the max frequency I would like
For context, I have a dataset containing a number of angles. My question is very similar to the question asked here Remove data above threshold in histogram in terms of the data used but unlike the question in the link, I dont wish to get rid of the data, just reduce it.
Can I do this directly from the histogram or will I need to just delete some of the data in the dataset?
edit (sorry I am new to coding and formatting here): here is a solution i tried
bns = 30
hist, bins = np.histogram(dataset['Steering'], bins= bns)
removeddata = []
spb = 700
for j in range(bns):
rdata = []
for i in range(len(dataset['Steering'])):
if dataset['Steering'][i] >= bins[j] and dataset['Steering'][i] <=
bins[j+1]:
rdata.append(i)
rdata = shuffle(rdata)
rdata = rdata[spb:]
removeddata.extend(rdata)
print('removed:', len(removeddata))
dataset.drop(dataset.index[removeddata], inplace = True)
print ('remaining:', len(dataset))
center = (bins[:-1] + bins[1:])*0.5
plt.bar(center,hist,width=0.05)
plt.show()
This is somebody else's solution but it seemed to work for them. Even directly copying, it still throws errors. The error I got was "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()", I tried to change 'and' to & and got the error "TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]". Unsure what this exactly refers to but points to the line with the if statement. Checked the dtype of everything and they are all type float64, so unsure of my next step
This solution takes into account the clarified requirement that the original input data that exceeds the frequency threshold be dropped. I left my other answer because it is simpler and different enough that it may be useful to another user.
To clarify, this answer produces a new 1D array of data with fewer elements and then plots a histogram from that new data. The data are shuffled before the elements are removed (in case the input data were pre-sorted) in order to prevent bias in dropping data from either the low or high side of each bin.
import numpy as np
import matplotlib.pyplot as plt
from random import shuffle
def remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst):
if to_gate_lst[idx] == 0:
return(data_lst)
else:
bin_min, bin_max = bins_lst[idx], bins_lst[idx + 1]
for i in range(len(data_lst)):
if bin_min <= data_lst[i] < bin_max:
del data_lst[i]
to_gate_lst[idx] -= 1
break
return remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst)
threshold = 80
fig, ax1 = plt.subplots()
ax1.set_title("Some data")
np.random.seed(30)
data = np.random.randn(1000)
num_bins = 23
raw_hist, raw_bins = np.histogram(data, num_bins)
to_gate = []
for i in range(len(raw_hist)):
if raw_hist[i] > threshold:
to_gate.append(raw_hist[i] - threshold)
else:
to_gate.append(0)
data_lst = list(data)
shuffle(data_lst)
for idx in range(len(raw_hist)):
remove_gated_val_recursive(idx, to_gate, raw_bins, data_lst)
new_data = np.array(data_lst)
hist, bins = np.histogram(new_data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
plt.show()
gives the following histogram, plotted from the new_data
array.