I have multiple arrays that correspond to data parameters of a time-series. The data parameters include things like speed, hour of occurrence, day of occurrence, month of occurrence, elapsed hour of occurrence, etc.
I am trying to find the indices that correspond to a grouping of a specified data parameter from highest to lowest frequency of occurrence.
As a simple example, consider the following:
import numpy as np
speed = np.array([4, 6, 8, 3, 6, 9, 7, 6, 4, 3])*100
elap_hr = sorted(np.random.randint(low=1, high=40, size=10))
## ... other time parameter arrays
print(speed)
# [400 600 800 300 600 900 700 600 400 300]
print(elap_hr)
# [ 1 2 6 7 13 19 21 28 33 38]
So observed speed = 400
(2 occurrences) corresponds to the elapsed hours = 1, 33
; speed = 600
(3 occurrences) corresponds to elapsed hours = 2, 13, 28
.
For this example, say I am interested in grouping speed
by frequency of occurrence. Once I have the indices that group speed
from highest to lowest frequency, I can apply the same indices on the other data parameter arrays (like elap_hr
).
I first sort and argsort speed
; then I find the unique elements of sorted speed
. I combine these to find the indices of sorted speed
that correspond to the sorted unique speed
, which are grouped as sub-arrays per value in the sorted unique speed
.
def get_sorted_data(data, sort_type='default'):
if sort_type == 'default':
res = sorted(data)
elif sort_type == 'argsort':
res = np.argsort(data)
elif sort_type == 'by size':
res = sorted(data, key=len)
return res
def sort_data_by_frequency(data):
uniq_data = np.unique(data)
sorted_data = get_sorted_data(data)
res = [np.where(sorted_data == uniq_data[i])[0] for i in range(len(uniq_data))]
res = get_sorted_data(res, 'by size')[::-1]
return res
sorted_speed = get_sorted_data(speed)
argsorted_speed = get_sorted_data(speed, 'argsort')
freqsorted_speed = sort_data_by_frequency(speed)
print(sorted_speed)
# [300, 300, 400, 400, 600, 600, 600, 700, 800, 900]
print(argsorted_speed)
# [3 9 0 8 1 4 7 6 2 5]
print(freqsorted_speed)
# [array([4, 5, 6]), array([2, 3]), array([0, 1]), array([9]), array([8]), array([7])]
In freqsorted_speed
, the first sub-array [4, 5, 6]
corresponds to the indices of elements [600, 600, 600]
in the sorted array.
This is ok up to this point. But, I want the indices to apply to all data parameter arrays. So, I need to map the argsorted indices to the original array indices.
def get_dictionary_mapping(keys, values):
## since all indices are unique, there is no worry about identical keys
return dict(zip(keys, values))
idx_orig = np.array([i for i in range(len(argsorted_speed))], dtype=int)
index_to_index_map = get_dictionary_mapping(idx_orig, argsorted_speed)
print(index_to_index_map)
# {0: 3, 1: 9, 2: 0, 3: 8, 4: 1, 5: 4, 6: 7, 7: 6, 8: 2, 9: 5}
print(speed[idx_orig])
# [400 600 800 300 600 900 700 600 400 300]
print(speed[argsorted_speed])
# [300 300 400 400 600 600 600 700 800 900]
print([index_to_index_map[idx_orig[i]] for i in range(len(idx_orig))])
# [3, 9, 0, 8, 1, 4, 7, 6, 2, 5]
I have all the necessary pieces to accomplish what I want. But I'm not quite sure how to put this altogether. Any advice would be appreciated.
EDIT:
As an end result, I would like to have the original indices of speed
grouped by frequency like so:
res = [[1, 4, 7], [3, 9], [0, 8], ...]
## corresponds to 3 600's, 2 300's, 2 400's, etc.
## for values of equal frequency, the secondary grouping is from min-to-max
This way, I can choose the values by the nth most frequent value or by the frequency itself.
Your desired result can be obtained as follows:
>>> idx = np.argsort(speed)
>>> res = sorted(np.split(idx, np.flatnonzero(np.diff(speed[idx])) + 1), key=len, reverse=True)
>>> res
[array([1, 4, 7]), array([3, 9]), array([0, 8]), array([6]), array([2]), array([5])]