Search code examples
datasetsimilarityrapidminer

From the ouput of Cross Distances operator of Rapid Miner, how to find 'Request set' row number/numbers from the 'Reference set'


I am new to Rapid Miner learning studio and its operators, while working with Rapid Miner i got stuck with a strange doubt and the issue is described issue -

  1. I have a data set of 100 rows and i am inputting this set to 'Filter Example Range' operator
  2. Output of 'Filter Example Range' operator will be 'Example set' and 'Original Set'
  3. 'Filter Example Range' output is set as input to 'Cross Distances' operator. One is 'Request set' with - first example: 5 and last example: 5 (this is 'Example set' of 'Filter Example Range' and number 5 indicates the row number from the actual) The other input is 'Reference Set' - 100 rows of data (this is 'Original set' of Filter Example Range' operator)
  4. From 'Cross Distances' operator we get three outputs. One is 'result set', 'Request set' and 'Reference set' (these both are inputs supplied too)

Now after getting the output from 'Cross Distances' operator, i want to know what is the row number of 'Request set' from the supplied 'reference set'.

Is there any chance to make a comparison of these both sets in 'Execute R' operator? or i request someone to please help me with any alternative.


Solution

  • The Cross Distances operator needs an id attribute and will add one if this is not present in the input example sets. The id attribute is a special attribute and is not used to calculate distances; only regular attributes are used for this. If the input example set contains an attribute called id that is regular, the operator changes this to be special thereby excluding it from the distance calculation.

    The output is a distance between pairs and each pair is referred to using the id from each input.

    So if the output looks like this (using the iris data set and selected the fifth one to be the request input and all the rest as the document input).

    request document distance
    id_5    id_5     0.0
    id_5    id_1     0.141
    

    it means that id_5 in the request and id_5 in the document are 0 distance apart, id_5 in the request and id_1 in the document are 0.141 apart.

    For id_1 and id_5 in the iris data set, the data is as follows.

    id    a1    a2    a3    a4
    id_1  5.1   3.5   1.4   0.2
    id_5  5.0   3.6   1.4   0.2
    

    The distance is

    sqrt((5.1-5.0)^2 + (3.5-3.6)^2 + (1.4-1.4)^2 + (0.2-0.2)^2)

    which is sqrt(0.01 + 0.01 + 0 + 0)

    and this becomes 0.141.