AzureML: experiment working for a subset and not for the whole dataset

some times ago I had written a code in AzureML meeting "out of memory" issues. So I tried to split the code in three different codes and that partially worked. It remains a part that (I think) is affected by memory issues too.

I have created an experiment that I have published in this link.

There is a module that considers only a sample of my dataset, and it does work. This means that the code is supposed to work correctly. If you remove the sampling code (the second module starting from the top)

and you connect directly the original dataset you have the following situation

producing the following error:

Does someone have some way to understand where Azure crashes?

Thanks you,

Andrea

Solution

Thanks so much for publishing the example -- this really helped to understand the issue. I suspect that you want to modify the gsub() calls in your script by adding the argument "fixed=TRUE" to each. (The documentation for this function is here.)

What appears to have happened is that somewhere in your full dataset -- but not in the subsampled dataset -- there is some text that winds up being included in df[i, "names"] as "(art.". Your script pads this into "\\b(art.\\b". The gsub() function tries to interpret this as a regular expression instead of a simple string, then throws an error because it is not a valid regular expression: it contains an opening parenthesis but no closing parenthesis. I believe that you actually did not want gsub() to interpret the input as a regular expression in the first place, and specifying gsub(..., fixed=TRUE) will correct that.

I believe the reason why this error disappears when you add the sample/partition module is because, by chance, the problematic input value was dropped on subsampling. I do not think it is an issue of available resources on Azure ML. (Caveat: I cannot confirm the fix works yet; I made the suggested update and started running the experiment, but it has not yet completed successfully.)