I'll try to make this question simple so that hopefully the concept is applicable to other fields.
I work with seismic datasets, where N earthquakes are recorded by M seismic stations. There is earthquake-specific information (earthquake ID, magnitude, etc.), and station-specific information (station ID, instrument sampling rate, etc.). Not all seismic stations record all earthquakes, so there are a variable number of stations that record each earthquake.
Currently, I organize this data in a Pandas DataFrame object, with one line per earthquake. Since there are a variable number of stations for each earthquake, I store the station IDs as a numpy array. Here is a simple example, with earthquakes 0001 and 0002 of magnitude 2.1 and 3.1, respectively. Four seismic stations, a, b, c, and d, have sampling rates of 100, 100, 60, and 60, respectively.
import numpy as np
import pandas as pd
events = {
'event_id': ['0001', '0002'],
'mag': [2.1, 3.1],
'station_ids': [np.array(['a', 'c']), np.array(['a', 'b', 'c', 'd'])],
'station_sampling_rate': [np.array([100, 60]), np.array([100, 100, 60, 60])]
}
events = pd.DataFrame(events)
print(events)
which results in
event_id mag station_ids station_sampling_rate
0 0001 2.1 [a, c] [100, 60]
1 0002 3.1 [a, b, c, d] [100, 100, 60, 60]
Now, when I need station-specific information from this DataFrame, either I need to:
events
and perform some check, and compare to an existing catalog of stations (slow)events
in such a way that it holds increasingly more station-specific information for each station (events
gets really big), and still loop through each line of events
Method 1 has worked for me so far, but is slow. Method 2 seems like a waste of resources.
The desired structure of the above example is this:
station_id sampling_rate event_ids mag
0 a 100 [0001, 0002] [2.1, 3.1]
1 b 100 [0002] [3.1]
2 c 60 [0001, 0002] [2.1, 3.1]
3 d 60 [0002] [3.1]
Is there an efficient way to structure, store, and transform this data? It seems there has to be something existing already to deal with this type of data.
I tried searching, but I don't know what this data structure is called, or if it has a name.
Looping through rows is intrinsically very slow in Pandas and (almost) never needed. For the example given:
events2 = events.explode(['station_ids', 'station_sampling_rate'])
results = events2.groupby(['station_ids', 'station_sampling_rate'], as_index = False)[['mag', 'event_id']].agg(list)
print(results)
gives the DF:
station_ids station_sampling_rate mag event_id
0 a 100 [2.1, 3.1] [0001, 0002]
1 b 100 [3.1] [0002]
2 c 60 [2.1, 3.1] [0001, 0002]
3 d 60 [3.1] [0002]