I am trying to cluster Educations. The data entries got a name and a description, like this:
MSc Aeronautical Engineering
The master´s programme in Aeronautical Engineering at Linköping University offers a holistic view on aircraft design. An aircraft is a complex, integrated, closely connected system of various technologies and disciplines such as: aerodynamics, structure, propulsion, actuation systems and other on-board systems. All these disciplines need to be optimized in order to achieve the function and the efficiency required in an aircraft. The programme involves a project in the later part when all these disciplines come together and challenge students to design, build and fly an aircraft, or a subscale version of it.
I make sequence files from my lucene index:
LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(conf,
Arrays.asList(indexFilesPath), sequenceFilesPath, "name",
Arrays.asList("name", "description"));
SequenceFilesFromLuceneStorage sequenceFilefromLuceneStorage = new SequenceFilesFromLuceneStorage();
sequenceFilefromLuceneStorage.run(luceneStorageConf);
I then generate sparsevectors. I set my args
to the correct paths, a MaxDFSigma to 5 and sequential true. Im don't know if this parameters are correct for my purpose.
ToolRunner.run(new SparseVectorsFromSequenceFiles(), args);
I am then running the CanopyDriver to generete input clusters to K-means. I put Tanimoto distance since I read that it is good for text clustering. T1 distance metric to 3.1, T2 distance metric 2.1, run clustering false, cluster classification threshold to 0, and run sequential true.
CanopyDriver.run(conf,
tfidfVectorsPath,
outputPath,
new TanimotoDistanceMeasure(),
3.1,
2.1,
false,
0.0,
true);
Last I run K-Means, with a convergence delta set to 0.001, max iterations 10, run clustering true, cluster classification threshold to 0 and run sequential true:
KMeansDriver.run(conf,
tfidfVectorsPath,
new Path(outputPath,"clusters-0-final"),
kmeansOutput,
0.001,
10,
true,
0.0,
false);
I can print out my clusters like this:
IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
while (reader.next(key, value)) {
System.out.println("Cluster " + key.toString() + " got the following vector " + value.toString());
}
reader.close();
An this is what it prints out:
Cluster 0 got the following vector wt: 1.0 distance: 0.861373565304716 vec: Acting = [1:2.735, 4:4.441, 5:4.441, 13:2.165, 25:2.224, 26:2.224, 30:1.754, 35:2.447, 36:2.735, 51:2.447, 116:2.735, 118:1.887, 174:2.480, 178:2.447, 179:2.735, 187:2.735, 205:2.447, 224:2.735, 240:2.735, 242:3.460, 248:2.447, 260:2.041, 267:1.887]
Cluster 0 got the following vector wt: 1.0 distance: 0.868019533374171 vec: Adult Learning and Global Change = [30:1.754, 34:2.447, 43:2.735, 56:2.447, 72:2.447, 80:2.735, 105:3.460, 106:2.735, 117:2.735, 142:2.447, 143:2.447, 148:2.447, 173:2.735, 176:2.447, 181:2.735, 199:2.735, 203:2.224, 214:2.447, 233:2.447, 247:2.735, 262:2.735, 268:3.460]
Cluster 0 got the following vector wt: 1.0 distance: 0.8630506879479874 vec: Agricultural Economics and Management = [8:5.469, 9:4.736, 21:2.447, 28:2.735, 29:2.735, 31:2.447, 33:2.735, 34:2.447, 39:2.735, 60:2.447, 70:2.735, 71:5.439, 94:2.447, 108:2.447, 111:2.447, 136:2.447, 149:3.460, 152:1.754, 167:2.735, 171:2.735, 189:2.447, 203:2.224, 206:2.224, 210:4.441, 242:2.447, 249:3.460, 257:2.735, 273:2.480]
Cluster 0 got the following vector wt: 1.0 distance: 0.8382953832498294 vec: Agroecology = [2:4.441, 3:2.447, 8:2.735, 9:4.736, 10:4.441, 12:2.735, 25:3.852, 26:3.145, 27:2.041, 30:1.754, 32:4.441, 44:2.447, 56:2.447, 61:2.735, 64:3.460, 69:2.735, 70:2.735, 79:2.447, 82:2.735, 83:4.441, 85:2.735, 86:4.441, 87:2.041, 93:2.447, 94:2.447, 105:2.447, 110:2.447, 118:1.887, 121:2.224, 128:2.735, 131:2.735, 133:1.887, 137:2.735, 139:1.636, 143:3.460, 144:4.441, 148:3.460, 152:1.754, 155:2.447, 165:2.735, 166:2.447, 167:2.735, 170:2.447, 171:2.735, 178:2.447, 182:2.735, 187:3.867, 189:4.894, 192:3.814, 198:3.867, 199:2.735, 202:1.636, 203:3.852, 206:2.224, 214:2.447, 215:2.447, 216:4.441, 226:3.775, 227:2.447, 228:2.041, 229:4.441, 230:2.224, 231:3.145, 237:2.224, 243:3.460, 248:2.447, 252:2.447, 254:2.735, 260:2.041, 263:2.224, 264:2.735, 267:1.887, 269:2.735]
Cluster 0 got the following vector wt: 1.0 distance: 0.8546104020199703 vec: Analytical Finance = [14:2.447, 15:3.867, 65:2.735, 72:2.447, 78:2.447, 89:5.439, 90:4.441, 97:3.145, 100:3.145, 133:2.669, 142:2.447, 149:2.447, 151:2.735, 183:2.447, 184:2.735, 192:2.335, 212:2.735, 268:2.447, 273:3.038]
Cluster 0 got the following vector wt: 1.0 distance: 0.8525453440258359 vec: Animal Science = [13:1.531, 16:7.021, 17:7.021, 27:2.041, 30:1.754, 42:2.735, 74:2.041, 85:2.735, 94:2.447, 98:2.447, 107:2.447, 108:2.447, 110:2.447, 111:2.447, 112:2.735, 119:4.441, 121:2.224, 133:1.887, 174:1.754, 175:5.471, 183:2.447, 189:2.447, 191:2.224, 202:2.314, 205:2.447, 207:2.669, 208:2.735, 217:4.441, 243:2.447, 251:4.441, 255:2.735, 259:3.460, 260:2.041, 261:2.447, 266:2.735, 267:1.887, 273:1.754]
Cluster 0 got the following vector wt: 1.0 distance: 0.8691568007982957 vec: Animated Film = [0:2.735, 13:1.531, 18:6.280, 35:2.447, 48:2.735, 52:3.867, 63:2.735, 117:2.735, 129:2.735, 164:2.224, 190:2.447, 232:2.735, 242:2.447, 247:2.735, 252:2.447, 253:2.735, 257:2.735, 259:3.460, 264:2.735, 267:3.269, 273:2.480]
Cluster 0 got the following vector wt: 1.0 distance: 0.845250503777627 vec: Applied English linguistics = [6:2.447, 13:1.531, 20:2.887, 23:1.754, 29:2.735, 53:2.224, 54:2.735, 63:2.735, 74:2.041, 78:5.994, 81:2.735, 88:2.447, 93:2.447, 101:2.735, 103:2.735, 129:2.735, 138:2.735, 139:1.636, 140:6.115, 146:5.439, 154:2.735, 159:2.041, 164:2.224, 170:2.447, 174:1.754, 192:2.335, 196:2.447, 200:2.735, 202:1.636, 214:2.447, 215:2.447, 223:4.441, 228:2.041, 246:2.735, 254:2.735, 263:3.145, 269:2.735]
Cluster 0 got the following vector wt: 1.0 distance: 0.8441577500264077 vec: Applied Mathematics Programme = [13:1.531, 20:2.887, 23:1.754, 26:2.224, 37:2.735, 47:2.447, 53:3.852, 59:5.439, 68:2.735, 72:3.460, 77:2.041, 78:2.447, 87:2.041, 88:2.447, 103:2.735, 104:4.441, 107:2.447, 114:2.735, 116:2.735, 139:2.314, 142:2.447, 152:1.754, 156:4.441, 157:8.881, 158:2.735, 159:2.887, 161:2.735, 163:2.735, 168:2.735, 182:2.735, 184:2.735, 191:4.973, 192:1.348, 193:3.460, 207:1.887, 218:2.735, 221:4.441, 227:4.894, 228:2.887, 241:2.735, 252:3.460, 260:2.041, 272:2.735, 273:4.960]
Cluster 0 got the following vector wt: 1.0 distance: 0.8214250552767353 vec: Applied Mechanics = [6:2.447, 13:1.531, 14:2.447, 20:2.887, 31:2.447, 42:2.735, 43:2.735, 46:2.735, 64:2.447, 66:2.447, 74:2.887, 77:3.536, 84:4.238, 93:2.447, 97:2.224, 107:2.447, 120:2.735, 121:2.224, 125:2.735, 127:2.735, 134:2.735, 136:2.447, 139:1.636, 155:2.447, 158:3.867, 162:3.145, 168:2.735, 174:1.754, 183:2.447, 186:2.447, 190:2.447, 192:1.907, 196:2.447, 202:1.636, 204:4.736, 213:2.735, 226:1.887, 235:4.441, 243:2.447, 256:2.735, 267:1.887]
Cluster 0 got the following vector wt: 1.0 distance: 0.8705019419490072 vec: Applied Physics = [20:2.041, 23:1.754, 25:2.224, 46:2.735, 57:2.447, 76:2.735, 77:2.041, 82:2.735, 84:2.447, 100:2.224, 162:2.224, 177:2.735, 180:6.280, 181:2.735, 192:1.348, 219:3.867, 232:2.735, 237:2.224, 238:2.735]
Cluster 0 got the following vector wt: 1.0 distance: 0.8549884296474971 vec: Aquatic Ecology Master Programme = [3:2.447, 6:3.460, 13:1.531, 19:2.735, 21:2.447, 22:7.692, 23:1.754, 27:2.041, 30:1.754, 33:2.735, 41:4.441, 44:2.447, 54:2.735, 55:4.441, 57:2.447, 65:2.735, 68:2.735, 69:6.698, 73:2.735, 74:2.887, 79:2.447, 87:2.887, 88:2.447, 91:2.735, 96:4.441, 97:3.145, 98:2.447, 100:2.224, 114:3.867, 118:1.887, 123:4.238, 132:2.735, 133:1.887, 135:4.441, 139:2.314, 149:3.460, 150:7.021, 152:2.480, 154:2.735, 160:4.441, 162:2.224, 164:2.224, 169:2.735, 172:2.735, 186:2.447, 192:2.335, 194:2.224, 200:2.735, 202:2.834, 203:4.448, 207:2.669, 209:2.735, 213:2.735, 219:2.735, 226:2.669, 231:3.145, 236:2.735, 239:4.441, 246:2.735, 250:2.735, 258:4.441, 271:2.735, 272:3.867]
Cluster 0 got the following vector wt: 1.0 distance: 0.8441981452499265 vec: Astronomy: Master's Degree Project = [3:2.447, 13:1.531, 27:2.041, 30:1.754, 37:2.735, 52:3.867, 57:2.447, 60:2.447, 102:4.441, 139:1.636, 147:2.735, 153:2.224, 159:2.041, 165:2.735, 172:2.735, 185:5.439, 194:3.145, 202:1.636, 204:3.867, 224:2.735, 228:2.041, 237:2.224, 245:2.735, 248:2.447, 263:2.224, 271:2.735, 273:3.508, 274:3.867]
Cluster 0 got the following vector wt: 1.0 distance: 0.8312263297298608 vec: Atmospheric Science, Master's Programme = [13:1.531, 15:2.735, 23:2.480, 24:7.021, 26:2.224, 34:4.238, 39:3.867, 56:2.447, 58:2.447, 61:2.735, 62:2.224, 64:2.447, 66:2.447, 73:3.867, 75:4.441, 79:2.447, 81:2.735, 84:2.447, 87:2.041, 99:2.735, 105:3.460, 110:2.447, 112:2.735, 118:2.669, 121:2.224, 122:2.447, 123:4.238, 130:2.447, 131:2.735, 133:1.887, 136:2.447, 139:1.636, 147:2.735, 153:2.224, 155:2.447, 159:2.041, 161:2.735, 174:2.480, 175:2.447, 179:2.735, 188:4.441, 191:2.224, 192:1.348, 195:2.735, 202:1.636, 207:3.775, 220:2.735, 225:2.735, 228:2.041, 230:2.224, 244:4.441, 249:4.238, 253:2.735, 260:2.041, 266:2.735]
Cluster 0 got the following vector wt: 1.0 distance: 0.8410581023430623 vec: Computer Science, Master's Programme = [1:3.867, 14:2.447, 19:2.735, 20:2.041, 23:3.038, 27:2.041, 28:3.867, 30:1.754, 31:2.447, 35:2.447, 36:2.735, 38:4.441, 40:2.735, 45:4.441, 47:7.341, 48:3.867, 49:5.469, 50:2.735, 51:2.447, 53:3.145, 58:2.447, 60:2.447, 62:4.973, 67:4.441, 74:2.887, 76:2.735, 77:2.041, 92:4.441, 95:4.441, 97:3.852, 99:2.735, 100:2.224, 101:2.735, 106:2.735, 108:3.460, 109:4.441, 111:3.460, 113:4.441, 115:4.441, 118:1.887, 122:2.447, 123:2.447, 124:3.140, 125:3.867, 126:2.735, 130:2.447, 132:2.735, 133:1.887, 134:2.735, 137:2.735, 139:1.636, 141:2.735, 143:2.447, 145:5.439, 148:2.447, 151:2.735, 152:1.754, 153:3.145, 159:2.041, 162:2.224, 164:4.448, 169:2.735, 175:3.460, 177:2.735, 178:2.447, 186:2.447, 190:3.460, 191:2.224, 192:3.015, 193:5.471, 194:2.224, 195:2.735, 196:2.447, 197:4.441, 198:2.735, 201:2.735, 202:3.272, 206:2.224, 207:5.968, 211:2.735, 212:2.735, 215:3.460, 218:3.867, 220:2.735, 225:2.735, 226:1.887, 227:2.447, 230:2.224, 231:4.973, 234:2.735, 236:4.736, 240:2.735, 249:2.447, 259:3.460, 261:2.447, 262:2.735, 263:3.145, 265:5.439, 267:2.669, 270:5.439, 273:5.546, 274:3.867]
Cluster 0 got the following vector wt: 1.0 distance: 0.8527476806601426 vec: MSc Aeronautical Engineering = [7:4.441, 11:6.280, 12:3.867, 40:2.735, 44:2.447, 50:2.735, 62:3.145, 66:4.238, 77:2.887, 98:2.447, 128:2.735, 152:1.754, 166:2.447, 170:2.447, 173:2.735, 174:1.754, 176:2.447, 192:1.907, 194:2.224, 201:2.735, 206:2.224, 226:1.887, 230:2.224, 231:3.145, 233:2.447, 245:2.735, 250:2.735, 255:2.735, 256:2.735, 261:2.447]
Cluster 0 got the following vector wt: 1.0 distance: 0.8189710533774026 vec: Master Programme in Computer Science = [0:2.735, 21:2.447, 23:1.754, 25:2.224, 47:7.341, 49:3.867, 51:2.447, 53:3.145, 58:2.447, 62:3.145, 80:2.735, 87:2.041, 91:2.735, 118:1.887, 120:2.735, 122:2.447, 126:2.735, 127:2.735, 130:2.447, 138:2.735, 140:2.735, 141:2.735, 152:2.480, 153:2.224, 163:2.735, 166:2.447, 174:1.754, 176:2.447, 192:2.335, 193:3.460, 205:2.447, 207:4.220, 208:2.735, 209:2.735, 211:2.735, 222:3.140, 226:1.887, 233:2.447, 234:4.736, 237:3.852, 238:2.735, 241:2.735, 268:2.447]
All educations are ending up in the same cluster. I have tried with different distance measures but it does not help, I have tried with different max iterations (up to 100) but with the same result. As we can see they all have a distance between 81-87, is this why they end up in the same cluster? When using other distance measures the span is bigger but they are still all in the same cluster. Do I need to remove more terms that are of no value for the characteristics of the course? How do I look at my top terms?
Long post and a lot of questions, I would be really glad for some help on this.
Thank you!
After reading up on my methods. I figured out that I didn't tweak my parameters for the CanopyDriver enough. From the book Mahout in Action I read that:
"Canopy clustering doesn’t require you to specify the number of cluster centroids as a parameter. The number of centroids formed depends only on the choice of the distance measures, T1 and T2"
It turned out to give me really good results!