Search code examples
pythonmachine-learningpatsy

Unmodified column name index in patsy


I am using patsy to prepare categorical data for regression and want to map from a column name to its index in the DesignMatrix. I have tried using the column_name_indexes attribute of the DesignInfo object but the column names have been modified to reflect the encoding.

Example using data from the docs:

>>> from patsy import demo_data, dmatrix
>>> data = demo_data("a", nlevels=3)
>>> data
{'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}

>>> x = dmatrix("a", data)
>>> x
DesignMatrix with shape (6, 3)
  Intercept  a[T.a2]  a[T.a3]
          1        0        0
          1        1        0
          1        0        1
          1        0        0
          1        1        0
          1        0        1
  Terms:
    'Intercept' (column 0)
    'a' (columns 1:3)

>>> x.design_info.column_name_indexes
OrderedDict([('Intercept', 0), ('a[T.a2]', 1), ('a[T.a3]', 2)])

I would like to be able to access the column index of e.g. 'a2' by calling:

x.design_info.column_name_indexes['a2']

But of course that returns KeyError: 'a2'. So instead I have to construct the modified key myself in order to obtain the desired column index 1:

x.design_info.column_name_indexes['a[T.a2]']

Is there a way to access the column index by referring to the unmodified feature/column name, i.e. 'a2' rather than having to construct the modified key, i.e. 'a[T.a2]'?


Solution

  • In general, there is no one-to-one mapping between categorical values like a2 and design matrix columns. The column you're talking about is already more complicated than that -- it's a treatment contrast between the a2 and a1 values -- and things can be arbitrarily more complicated than that (e.g. consider Helmert or polynomial coding).

    If you know that you want to look up the treatment contrast associated with a2 of the variable a, then you can use

    def column_for_treatment(design_info, factor, value):
        column_name = "{}[T.{}]".format(factor, value)
        return design_info.column_name_indexes[colum_name]
    
    column_for_treatment(x.design_info, "a", "a2")
    

    It's a bit silly-looking, but it should work, and I'm not sure what would be better given the general problems mentioned above.