Search code examples
pythonpandasdataframemulti-index

Using arrays in order to select values from multiindex


I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)

[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64

which is as intended. But then if I want to obtain two values (1,2) and (2,3):

[in] df.loc[([1, 2], [2, 3]), :]
[out]
        value
a_idx   b_idx   
1   2   1
3   6
2   2   3
3   9

Which is not what I wanted, I needed the specific pairs, not the 4 values.

Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for

select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]

My gist was that I should do this using:

df.loc[(select_a, select_b), :]

and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).

I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.

If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)

The dataframe is created using the following code:

numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
        [numbers["a"].values, numbers["b"].values],
        names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
       index=index_cols,
       columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)

Solution

  • You were almost there. To get the pair for those indexes, you need to have the syntax like this:

    df.loc[[(1, 2), (2, 3)], :]
    

    You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.