Search code examples
pythondata-processingpatsy

How to prepare large datasets with Patsy's API?


I'm running a logistic regression and having trouble using Patsy's API to prepare the data when it is bigger than a small sample.

Using the dmatrices function directly on a DataFrame, I am left with this abrupt error ( please note, I spun up an EC2 with 300GB of RAM after encountering this on my laptop, and got the same error ):

Traceback (most recent call last):
File "My_File.py", line 22, in <module>
   df, return_type="dataframe")
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
 NA_action, return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design
return_type=return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices
results.append(builder._build(evaluator_to_values, dtype))
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype),
MemoryError

So, I combed through Patsy's docs and found this gem:

patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0)
    Construct a design matrix builder incrementally from a large data set.

However, the method is sparsely documented, and the source code is largely uncommented.

I have arrived at this code:

def iter_maker():
    with open("test.tsv", "r") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            yield(row)


y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \
C(z):C(f) + C(r):C(p) + C(q):C(w) + \
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)",
        iter_maker)

df = dmatrix(dta, {}, 0, "drop", return_type="dataframe")

but I receive PatsyError: Error evaluating factor: NameError: name 'ff' is not defined

This is being thrown because _try_incr_builders (called from dmatrix) is returning None on line 151 of highlevel.py

What is the correct way to use these Patsy functions to prepare my data? Any examples or guidance you may have will be helpful.


Solution

  • y and dta are DesignInfo objects -- they encode all the information needed to take a row of a data frame and convert it to a row of a design matrix. They do not, though, have your actual data in them -- to get a piece of your design matrix, you have to give them a piece of your data. To use them, you need to do something like

    for data_chunk in iter_maker():
      y_chunk, design_chunk = dmatrices((y, dta), data_chunk,
                                        NA_action="drop", return_type="dataframe")
      # do something with y_chunk and design_chunk
      # ...