Search code examples
pythondataframeregressionfactors

Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?


If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:

1 0 0
0 1 0
or
0 0 1

for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.

Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):

>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
       a  b         c
0    one  x  0.000343
1    one  y -0.055651
2    two  y  0.249194
3  three  x -1.486462
4    two  y -0.406930
5    one  x -0.223973
6    six  x -0.189001
>>> 

The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.

Thanks,

SetJmp


Solution

  • There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.

    Here is an example usage:

    import pandas
    import patsy
    
    dataFrame = pandas.io.parsers.read_csv("salary2.txt") 
    #salary2.txt is a re-formatted data set from the textbook
    #Introductory Econometrics: A Modern Approach
    #by Jeffrey Wooldridge
    y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
    #X.design_info provides the meta data behind the X columns
    print X.design_info
    

    generates:

    > DesignInfo(['Intercept',
    >             'sx[T.male]',
    >             'rk[T.associate]',
    >             'rk[T.full]',
    >             'dg[T.masters]',
    >             'yr',
    >             'yd'],
    >            term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
    > (Term([EvalFactor('rk')]), slice(2, 4, None)),
    > (Term([EvalFactor('dg')]), slice(4, 5, None)),
    > (Term([EvalFactor('yr')]), slice(5, 6, None)),
    > (Term([EvalFactor('yd')]), slice(6, 7, None))]),
    >            builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)