Trying to seek some guidance on the best way of curating an extensive ETL process. My pipeline has a reasonably sleek extract section, and loads into a designated file in a succinct manner; but the only way I can think to do transformation steps is a series of variable assignments:
a = ['some','form','of','petl','data']
b = petl.addfield(a, 'NewStrField', str(a))
c = petl.addrownumbers(b)
d = petl.rename(c, 'row', 'ID')
.......
Reformatting to assign the same variable name makes some sense, but doesn't aid readability:
a = ['some','form','of','petl','data']
a = petl.addfield(a, 'NewStrField', str(a))
a = petl.addrownumbers(a)
a = petl.rename(a, 'row', 'ID')
.......
I've read up on multiple method calls like this:
a = ['some','form','of','data']
result = petl.addfield(a, 'NewStrField', str(a))
.addrownumbers(a)
.rename(a, 'row', 'ID')
.......
but that won't work, as the functions require the table as the first parameter passed.
Is there some fundamental I am missing? I'm loathe to believe that the right way of doing this commercially involves 1000+ LOC?
Create a list of partially applied functions, then loop over that list.
transforms = [
lambda x: petl.addfield(x, 'NewStrField', str(x)),
petl.addrownumbers,
lambda x: petl.rename(x, 'row', 'ID')
]
a = ['some', 'form', 'of', 'petl', 'data']
for f in transforms:
a = f(a)
Your "total" transformation is the composition of the transformations in the list transforms
. You can do those upfront (at the cost of some additional function calls) using a library that provides function composition, or rolling your own.
def compose(*f):
if not f:
return lambda x: x # Identity function, the identity for function composition
return lambda x: f[0](compose(f[1:])(x))
# Note the reversed order of the functions compared to
# the list above.
transform = compose(
lambda x: petl.rename(x, 'row', 'ID'),
petl.addrownumbers,
lambda x: petl.addfield(x, 'NewStrField', str(x)),
)
a = ['some', 'form', 'of', 'petl', 'data']
result = transform(a)