Search code examples
pythonpandasdataframepython-xarray

How to properly access Pandas DataFrame generated from xarray Dataset


I have an xarray dataset created and converted to pandas like so:

arr = xr.Dataset(
    coords={
        "test1": range(20000,60000+1,2500),
        "test2": range(10, 100+1),
        "test3": range(1,10+1),
        "count_at_1": 0,
        "count_at_5": 0,
        "count_at_10": 0,
    }
)

df = arr.to_dataframe()

The dataframe looks like this, which seems to be exactly what I want:

                   count_at_1  count_at_5  count_at_10
test1 test2 test3                                     
20000 10    1               0           0            0
            2               0           0            0
            3               0           0            0
            4               0           0            0
            5               0           0            0
...                       ...         ...          ...
60000 100   6               0           0            0
            7               0           0            0
            8               0           0            0
            9               0           0            0
            10              0           0            0

However, when I try to access a specific value inside this dataframe it causes some issues:

print(df["count_at_1"][50000][70][5]) # works fine, prints 0 as it should

df.loc["count_at_1"][50000][70][5] = 10 # does not work, KeyError: 'count_at_1'
df.at["count_at_1"][50000][70][5] = 10 # does not work, gives TypeError

I would also like to print out all the count_at_x values for a certain test1, test2, test3. Should look something like this:

print(df[50000][70][5])
count_at_1  count_at_5  count_at_10
         0           0            0

Solution

  • You just have the wrong indexing syntax. .loc and .at index rows when you give them a scalar, not columns. You can actually give them a tuple of (row, column) instead.

    df.loc[(50000, 70, 5), "count_at_1"] = 11
    df.at[(50000, 70, 5), "count_at_1"] = 12
    

    You should use something similar for printing the value too, either:

    print(df.loc[(50000, 70, 5), "count_at_1"])
    print(df.at[(50000, 70, 5), "count_at_1"])
    

    To get all the values on this row, you can use either:

    >>> df.loc[(50000, 70, 5)]  # Single row = Series
    count_at_1     12
    count_at_5      0
    count_at_10     0
    Name: (50000, 70, 5), dtype: int64
    
    >>> df.loc[[(50000, 70, 5)]]  # Selection of one row = df
                       count_at_1  count_at_5  count_at_10
    test1 test2 test3                                     
    50000 70    5              12           0            0
    

    I'm not terribly familiar with xarray, but part of your confusion might stem from the fact that Pandas DataFrames are fundamentally 2D, so indexing multiple levels doesn't really make sense.

    For more info, see the Pandas user guide: