Suppose I have the following columns for a csv that I read through a 'File Reader' node:
id, name, city, income
After reading it, I notice that the column 'city' contains a huge number of unique values. I want to:
Example:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Toronto, 90.000
3, Person 3, New York, 50.000
4, Person 4, Seattle, 60.000
Choosing k to be 1, I want to produce the following table:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Other, 90.000
3, Person 3, New York, 50.000
4, Person 4, Other, 60.000
It happens because 'New York' is the '1' most frequent value for 'city' in the original table.
Do you know how I can do that using Knime?
Thanks a lot!
You can use the CSV Reader to read the data. With the Statistics and Row Filter nodes you can find the k most frequent values. From those, you can create a collection cell using GroupBy. With that collection value, you can use Rule Engine with a similar ruleset:
$city$ IN $most frequent cities$ => $city$
TRUE => "Other"