python scikit-learn pipeline automl neuraxle

AutoML Pipelines: Label extraction from input data and sampling within Neuraxle or SKLearn Pipelines

I am working on a project that is looking for a lean Python AutoML pipeline implementation. As per project definition, data entering the pipeline is in the format of serialised business objects, e.g. (artificial example):

property.json:
{
   "area": "124",
   "swimming_pool": "False",
   "rooms" : [
      ... some information on individual rooms ...
   ]
}

Machine learning targets (e.g. predicting whether a property has a swimming pool based on other attributes) are stored within the business object rather than delivered in a separate label vector and business objects may contain observations which should not be used for training.

What I am looking for

I need a pipeline engine which supports initial (or later) pipeline steps that i) dynamically change the targets in the machine learning problem (e.g. extract from input data, threshold real values) and ii) resample input data (e.g. upsampling, downsampling of classes, filtering observations).

The pipeline ideally should look as follows (pseudocode):

swimming_pool_pipeline = Pipeline([
    ("label_extractor", SwimmingPoolExtractor()),  # skipped in prediction mode
    ("sampler", DataSampler()),  # skipped in prediction mode
    ("featurizer", SomeFeaturization()),
    ("my_model", FitSomeModel())
])

swimming_pool_pipeline.fit(training_data)  # not passing in any labels
preds = swimming_pool_pipeline.predict(test_data)

The pipeline execution engine needs to fulfill/allow for the following:

During model training (.fit()) SwimmingPoolExtractor extracts target labels from the input training data and passes labels on (alongside independent variables);
In training mode, DataSampler() uses the target labels extracted in the previous step to sample observations (e.g. could do minority upsampling or filter observations);
In prediction-mode, the SwimmingPoolExtractor() does nothing and just passes on the input data;
In prediction-mode, the DataSampler() does nothing and just passes on the input data;

Example

For example, assume that the data looks as follows:

property.json:
"properties" = [
    { "id_": "1",
      "swimming_pool": "False",
      ..., 
    },
    { "id_": "2",
      "swimming_pool": "True",
      ..., 
    },
    { "id_": "3",
      # swimming_pool key missing
      ..., 
    }
]

The application of SwimmingPoolExtractor() would extract something like:

"labels": [
    {"id_": "1", "label": "0"}, 
    {"id_": "2", "label": "1"}, 
    {"id_": "3", "label": "-1"}
]

from the input data and pass it set these as the machine learning pipeline's "targets".

The application of DataSampler() could for example further include logic that removes any training instance from the entire set of training data which did not contain any swimming_pool-key (label = -1).

Subsequent steps should use the modified training data (filtered, not including observation with id_=3) to fit the model. As stated above, in prediction mode, the DataSampler and SwimmingPoolExtractor would just pass through input data

How To

To my knowledge, neither neuraxle nor sklearn (for the latter I am sure) offer pipeline steps that meet the required functionality (from what I have gathered so far neuraxle must at least have support for slicing data, given it implements cross-validation meta-estimators).

Am I missing something, or is there a way to implement such functionality in either of the pipeline models? If not, are there alternatives to the listed libraries within the Python ecosystem that are reasonably mature and support such usecases (leaving aside issues that might arise from designing pipelines in such a manner)?

Solution

"Am I missing something, or is there a way to implement such functionality"

Yes, all you want to do can be done rather easily with Neuraxle:

You're missing out on the output handlers to transform output data! With this, you can send some x into y within the pipeline (thus effectively not passing in any labels to fit as you want to do).
You're also missing out on the TrainOnlyWrapper to transform data only at train time! This is useful to deactivate any pipeline step at test-time (and also at validation-time). Note that this way, it won't do the data filtering or resampling when evaluating the validation metrics.
You could also use the AutoML object to do the training loop.

Provided that your input data passed in "fit" is an iterable of something (e.g.: don't pass the whole json at once, at least make something that can be iterated on). At worst, pass a list of IDs and do a step that will convert the IDs to something else using an object that can go take the json by itself to do whatever it needs with the passed IDs, for instance.

Here is your updated code:

from neuraxle.pipeline import Pipeline

class SwimmingPoolExtractor(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep): # Note here: you may need to delete the NonFittableMixin from the list here if you encounter problems, and define "fit" yourself rather than having it provided here by default using the mixin class. 
    def transform(self, data_inputs):
        # Here, the InputAndOutputTransformerMixin will pass 
        # a tuple of (x, y) rather than just x. 
        x, _ = data_inputs

        # Please note that you should pre-split your json into 
        # lists before the pipeline so as to have this assert pass: 
        assert hasattr(x, "__iter__"), "input data must be iterable at least."
        x, y = self._do_my_extraction(x)  # TODO: implement this as you wish!

        # Note that InputAndOutputTransformerMixin expects you 
        # to return a (x, y) tuple, not only x.
        outputs = (x, y) 
        return outputs

class DataSampler(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        # TODO: implement this as you wish!
        data_inputs = self._do_my_sampling(data_inputs)

        assert hasattr(x, "__iter__"), "data must stay iterable at least."
        return data_inputs

swimming_pool_pipeline = Pipeline([
    TrainOnlyWrapper(SwimmingPoolExtractor()),  # skipped in `.predict(...)` call
    TrainOnlyWrapper(DataSampler()),  # skipped in `.predict(...)` call
    SomeFeaturization(),
    FitSomeModel()
])

swimming_pool_pipeline.fit(training_data)  # not passing in any labels!
preds = swimming_pool_pipeline.predict(test_data)

Note that you could also do as follow to replace the call to `fit`:

auto_ml = AutoML(
    swimming_pool_pipeline,
    validation_splitter=ValidationSplitter(0.20),  # You can create your own splitter class if needed to replace this one. Dig in the source code of Neuraxle an see how it's done to create your own replacement. 
    refit_trial=True,
    n_trials=10,
    epochs=1,
    cache_folder_when_no_handle=str(tmpdir),
    scoring_callback=ScoringCallback(mean_squared_error, higher_score_is_better=False)  # mean_squared_error from sklearn
    hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=str(tmpdir))
)

best_swimming_pool_pipeline = auto_ml.fit(training_data).get_best_model()
preds = best_swimming_pool_pipeline.predict(test_data)

Side note if you want to use the advanced data caching features

If you want to use caching, you should not define any transform methods, and instead you should define handle_transform methods (or related methods) so as to keep the order of the data "ID"s when you resample the data. Neuraxle is made to process iterable data, and this is why I've done some asserts above so as to ensure your json is already preoprocessed such that it is some kind of list of something.