Search code examples
rapidminer

Find a word in excel file in Rapidminer


I have Process that read a text file and have a operator Process Document from Data Operator that have Tokenize operator.

It work normally but when I change the source of Process Document from Data to Read Excel, the output is empty. I think that I have mistake and the Read Excel operator can not connect to Process Document from Data directly and must read every column of Excel file and then connect to Process Document from Data.

Anybody can help me how I connect Excel file from Process Document from Data?

PS: My goal is read excel file and show the word that repeat in column of excel file more that 3 times.

Sample file is: enter image description here


Solution

  • Since you don't include your process or input data, may I simply suggest an alternative without Documents at all?

    If your goal is to find entries in a specific column of an Excel file, you can do this in three operators: Read Excel, Aggregate and Filter Examples:

    Use Read Excel to extract the column as an example set with a single attribute (e.g. words), Aggregate the words attribute with the count function and also group by words (this gives you your desired count per word) and finally use Filter Examples to only keep words with a count of 3 or more.

    Example process (re-run the import configuration wizard for your specific setup):

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
            <parameter key="excel_file" value="D:\words.xlsx"/>
            <parameter key="imported_cell_range" value="A1:A100"/>
            <list key="annotations"/>
            <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="words.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="9.0.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
            <list key="aggregation_attributes">
              <parameter key="words" value="count"/>
            </list>
            <parameter key="group_by_attributes" value="words"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="count(words).ge.3"/>
            </list>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>