Search code examples
dataframejuliamulti-index

Multi-level indexing of data frames in Julia?


May I know how to apply multi-level indexing on data frames in Julia? Or is there any other method, approach or package to achieve this objective.

Update

Example python code:

import numpy as np
import pandas as pd
arrays = [np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
          np.array(["one", "two", "one", "two", "one", "two", "one", "two"]), ]

df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Output:->

enter image description here

Thanks!!


Solution

  • I understand your question but the point is what do you need to use the index for.

    Here is how groupby works:

    julia> using DataFrames
    
    julia> df = DataFrame(x=repeat(["bar", "baz"], inner=3), y=repeat(["one", "two"], outer=3), z=1:6)
    6×3 DataFrame
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     one         1
       2 │ bar     two         2
       3 │ bar     one         3
       4 │ baz     two         4
       5 │ baz     one         5
       6 │ baz     two         6
    
    julia> groupby(df, :x) # 1-level index
    GroupedDataFrame with 2 groups based on key: x
    First Group (3 rows): x = "bar"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     one         1
       2 │ bar     two         2
       3 │ bar     one         3
    ⋮
    Last Group (3 rows): x = "baz"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ baz     two         4
       2 │ baz     one         5
       3 │ baz     two         6
    
    julia> groupby(df, :y) # 1-level index
    GroupedDataFrame with 2 groups based on key: y
    First Group (3 rows): y = "one"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     one         1
       2 │ bar     one         3
       3 │ baz     one         5
    ⋮
    Last Group (3 rows): y = "two"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     two         2
       2 │ baz     two         4
       3 │ baz     two         6
    
    julia> groupby(df, [:x, :y]) # 2-level index
    GroupedDataFrame with 4 groups based on keys: x, y
    First Group (2 rows): x = "bar", y = "one"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     one         1
       2 │ bar     one         3
    ⋮
    Last Group (1 row): x = "baz", y = "one"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ baz     one         5
    

    Now an example of indexing for 2-level index:

    julia> gdf = groupby(df, [:x, :y]) # 2-level index
    GroupedDataFrame with 4 groups based on keys: x, y
    First Group (2 rows): x = "bar", y = "one"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     one         1
       2 │ bar     one         3
    ⋮
    Last Group (1 row): x = "baz", y = "one"
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ baz     one         5
    
    julia> gdf[("bar", "two")]
    1×3 SubDataFrame
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ bar     two         2
    
    julia> gdf[("baz", "two")]
    2×3 SubDataFrame
     Row │ x       y       z
         │ String  String  Int64
    ─────┼───────────────────────
       1 │ baz     two         4
       2 │ baz     two         6
    

    Now there is a difference between DataFrames.jl and Pandas in indexing. For Pandas you have (see here for benchmarks):

    When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

    while for DataFrames.jl no matter what source columns you use for indexing lookup is always O(1).