Changing Schemas with Hadoop Cascading

I'm trying to figure out how to use cascading against an archive of data whose schema is additive over time. Why I mean by additive is that it will start out with 3 columns, for example. Then in the next release it might have 5 columns. These columns follow standard CSV layouts. My understanding is that if I specify a schema to be 5 columns long and the old schema is only 3, then Cascading will fail.

Is there a way to tell cascading to fill in the missing columns? Like a default = null?

Solution

It turns out that, in the case of delimited text, there is a special constructor for the scheme. The constructor here, Cascading JavaDoc, says that we can adjust the strictness of the parse. If you say that strict is false, Cascading will load the data in with null appended to the end. The confusion on this seems to understandable since there are two threads about how to do this in the cascading user group.