I am trying to calculate tfidf values for files and save them into a matrix, the tfidf values I want to normalize them between 0 and 1 first. But I have a problem, The first value calculated after normalization is NAN, how can I fix this problem.
This is what I did
double tf; //term frequency
double idf; //inverse document frequency
double tfidf = 0; //term frequency inverse document frequency
double minValue=0.0;
double maxValue=0;
File output = new File("E:/hsqldb-2.3.2/hsqldb-2.3.2/hsqldb/hsqldb/matrix.txt");
FileWriter out = new FileWriter(output);
mat= new String[termsDocsArray.size()][allTerms.size()];
int c=0; //for files
for (String[] docTermsArray : termsDocsArray) {
int count = 0;//for words
for (String terms : allTerms) {
tf = new TfIdf().tfCalculator(docTermsArray, terms);
idf = new TfIdf().idfCalculator(termsDocsArray, terms);
tfidf = tf * idf;
//System.out.print(terms+"\t"+tfidf+"\t");
//System.out.print(terms+"\t");
tfidf = Math.round(tfidf*10000)/10000.0d;
tfidfList.add(tfidf);
maxValue=Collections.max(tfidfList);
tfidf=(tfidf-minValue)/(maxValue-minValue); //Normalization here
mat[c][count]=Double.toString(tfidf);
count++;
}
c++;
}
This is the output I got
NaN 1.0 0.0 0.021
0.0 1.0 0.0 0.365 ... and others
only the first number is NAN, also this number is originally a number that is repeated many times in the matrix but its value is not NAN
Please give me some ideas to fix this issue.
Thanks
You're dividing by zero. This will happen when the first value that is added to the tfidflist
is 0.0
.
In order to perform a real normalization, you'll probably have to compute all possible values first, then compute the min/max of these values, and afterwards, normalize all values based on these min/max values. Roughly:
// First collect all values and compute min/max on the fly
double minValue=Double.MAX_VALUE;
double maxValue=-Double.MAX_VALUE;
double values = new String[termsDocsArray.size()][allTerms.size()];
int c=0; //for files
for (String[] docTermsArray : termsDocsArray) {
int count = 0;//for words
for (String terms : allTerms) {
double tf = new TfIdf().tfCalculator(docTermsArray, terms);
double idf = new TfIdf().idfCalculator(termsDocsArray, terms);
double tfidf = tf * idf;
tfidf = Math.round(tfidf*10000)/10000.0d;
minValue = Math.min(minValue, tfidf);
maxValue = Math.max(maxValue, tfidf);
values[c][count]=tfidf;
count++;
}
c++;
}
// Then, create the matrix containing the strings of the normalized
// values (although using strings here seems like a bad idea)
c=0; //for files
for (String[] docTermsArray : termsDocsArray) {
int count = 0;//for words
for (String terms : allTerms) {
double tfidf = values[c][count];
tfidf=(tfidf-minValue)/(maxValue-minValue); //Normalization here
mat[c][count]=Double.toString(tfidf);
count++;
}
c++;
}