Suppose I have a dataset I want to run a Mahout clustering job on. I want each data point to have a unique identifier, such as an ID number. I don't want to append the ID to the vector as this way it will be included in the clustering calculations. How can I include an identifier in the data without the algorithm including the ID number in its calculations? Is there a way to have the input be a key-value pair where the key is the ID and the value is the Vector I want to run the algorithm on?
Alison before worrying about this, see the output first. Many times, you have lines of assignedCLusterIDs, where line orders in input and output files are the same. For example, the node in the first line of your input file will be in the first line of the output file. So you can keep ids in a separate file, their vectors in the input file. Then you can combine the separate file and the output file to see which node is assigned which cluster.