Search code examples
pythondataframecsvrapidminer

ValueError when returning pandas DataFrame from Execute Python processor in RapidMiner Studio


In RapidMiner Studio 9.5.1, after my python script completes, I can print the resulting dataframe and see that it is produced as expected with the proper columns. The rapidminer processor yet fails with the message:

Exception: com.rapidminer.operator.OperatorException
Message: Script terminated abnormally: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Stack trace:
  com.rapidminer.extension.pythonscripting.operator.scripting.AbstractScriptRunner.run(AbstractScriptRunner.java:137)
  com.rapidminer.extension.pythonscripting.operator.scripting.AbstractScriptingLanguageOperator.doWork(AbstractScriptingLanguageOperator.java:210)
  com.rapidminer.extension.pythonscripting.operator.scripting.python.PythonScriptingOperator.doWork(PythonScriptingOperator.java:434)
  com.rapidminer.operator.Operator.execute(Operator.java:1032)
  com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:77)
  com.rapidminer.operator.ExecutionUnit$2.run(ExecutionUnit.java:812)
  com.rapidminer.operator.ExecutionUnit$2.run(ExecutionUnit.java:807)
  java.security.AccessController.doPrivileged(Native Method)
  com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:807)
  com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:423)
  com.rapidminer.operator.Operator.execute(Operator.java:1032)
  com.rapidminer.Process.executeRoot(Process.java:1378)
  com.rapidminer.Process.lambda$executeRootInPool$5(Process.java:1357)
  com.rapidminer.studio.concurrency.internal.AbstractConcurrencyContext$AdaptedCallable.exec(AbstractConcurrencyContext.java:328)
  java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
  java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
  java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
  java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

without providing any other insight nor referencing a line within my code in the script. I have updated the numpy library in case it was a compatibility problem with older versions but still no solution.

numpy                     1.14.5                   pypi_0    pypi
numpy-base                1.16.4           py36hc3f5095_0    defaults
numpydoc                  0.9.1                      py_0    defaults
pandas                    0.25.3           py36ha925a31_0    defaults

Also, when checking if the python environment is ok (Anaconda env), from the Settings>Preferences>Python Scripting in RapidMiner, all tests pass with success.

The processor xml from the .rmp file is:

  <operator activated="true" class="python_scripting:execute_python" compatibility="9.5.000" expanded="true" height="103" name="Execute Python" width="90" x="313" y="34">
    <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;    print('Hello, world!')&#10;    # output can be found in Log View&#10;    print(type(data))&#10;&#10;    #your code goes here&#10;&#10;    #for example:&#10;    data2 = pandas.DataFrame([3,5,77,8])&#10;&#10;    # connect 2 output ports to see the results&#10;    return data, data2"/>
    <parameter key="script_file" value="%{ResourcePath}\detect_aggressive_language.py"/>
    <parameter key="notebook_cell_tag_filter" value=""/>
    <parameter key="use_default_python" value="true"/>
    <parameter key="package_manager" value="conda (anaconda)"/>
    <description align="center" color="transparent" colored="false" width="126">Detect Script</description>
  </operator>

Up to now, I have tried:
1. Update the initial DataFrame (data) with my computed columns and return it.
2. Create a new DataFrame with my columns and return that either alone or as second argument after data.
3. Create a method (within the script) that accepts the initial DataFrame data as argument, modified it, and then return it.
4. Pickle the new DataFrame, save it, load it and return it.
All these tries resulted in the same error presented above.

My guessing is that RapidMiner makes some kind of check upon the processor's completion that uses the code which produces the error above, so it fails and the processor terminates.

Is there a special proper way to handle and return DataFrames in RapidMiner to bypass the error, or is there anything else I could examine for finding out where the problem lies?


Solution

  • In order to further debug the problem, I started adding one-by-one the new columns to my resulting DataFrame. This lead me to the following discovery:

    The problem occurs when the DataFrame contains a column (pandas.DataFrame.Series) whose elements are numpy.ndarrays or lists, whose elements are all zeroes (integers or floats). When the "Execute Python" processor returns, RapidMiner tries to determine whether each cell of the DataFrame being returned in Null or has a value. In order to do that, based on the exception stack trace, the code must be checking whether the cell's contents is None, which is not a valid way to make this check, when the elements are lists or numpy ndarrays. Hence the exception message, which informs us that the truth value (or whether it is None or not) cannot be determined when more that one elements exist in the array, despite them being all zero.

    So the solution, in this case, is to ensure that when the returning DataFrame has a column which contains lists or arrays, no instance of them contains all zeroes. One could also avoid putting lists or arrays in the returning DataFrame. One more thing that could work is to make the proper nullity check within the code (using array.all()) and when an array or list with all zero elements is found, the whole cell's contents are replace with None or another value which the receiver of the result will interpret as null. Of course, one could also wait for the next version of the RapidMiner Studio, which might do the check in the proper way.