Search code examples
javadoublenormalization

NAN when Normalize double values


I am trying to calculate tfidf values for files and save them into a matrix, the tfidf values I want to normalize them between 0 and 1 first. But I have a problem, The first value calculated after normalization is NAN, how can I fix this problem.

This is what I did

    double tf; //term frequency
    double idf; //inverse document frequency
    double tfidf = 0; //term frequency inverse document frequency 
    double minValue=0.0;
    double maxValue=0;
    File output = new File("E:/hsqldb-2.3.2/hsqldb-2.3.2/hsqldb/hsqldb/matrix.txt");
    FileWriter out = new FileWriter(output); 
    mat= new String[termsDocsArray.size()][allTerms.size()];
    int c=0; //for files
    for (String[] docTermsArray : termsDocsArray) {
        int count = 0;//for words
        for (String terms : allTerms) {
            tf = new TfIdf().tfCalculator(docTermsArray, terms);
            idf = new TfIdf().idfCalculator(termsDocsArray, terms);
            tfidf = tf * idf;           
            //System.out.print(terms+"\t"+tfidf+"\t");
            //System.out.print(terms+"\t");

            tfidf = Math.round(tfidf*10000)/10000.0d;
            tfidfList.add(tfidf);
            maxValue=Collections.max(tfidfList);
            tfidf=(tfidf-minValue)/(maxValue-minValue);  //Normalization here
            mat[c][count]=Double.toString(tfidf);
            count++;   
        }     
    c++;
    }

This is the output I got

NaN 1.0  0.0  0.021
0.0 1.0 0.0 0.365 ... and others

only the first number is NAN, also this number is originally a number that is repeated many times in the matrix but its value is not NAN

Please give me some ideas to fix this issue.

Thanks


Solution

  • You're dividing by zero. This will happen when the first value that is added to the tfidflist is 0.0.

    In order to perform a real normalization, you'll probably have to compute all possible values first, then compute the min/max of these values, and afterwards, normalize all values based on these min/max values. Roughly:

    // First collect all values and compute min/max on the fly
    double minValue=Double.MAX_VALUE;
    double maxValue=-Double.MAX_VALUE;
    double values = new String[termsDocsArray.size()][allTerms.size()];
    int c=0; //for files
    for (String[] docTermsArray : termsDocsArray) {
        int count = 0;//for words
        for (String terms : allTerms) {
            double tf = new TfIdf().tfCalculator(docTermsArray, terms);
            double idf = new TfIdf().idfCalculator(termsDocsArray, terms);
            double tfidf = tf * idf;           
            tfidf = Math.round(tfidf*10000)/10000.0d;
            minValue = Math.min(minValue, tfidf);
            maxValue = Math.max(maxValue, tfidf);
            values[c][count]=tfidf;
            count++;   
        }     
        c++;
    }
    
    // Then, create the matrix containing the strings of the normalized 
    // values (although using strings here seems like a bad idea)
    c=0; //for files
    for (String[] docTermsArray : termsDocsArray) {
        int count = 0;//for words
        for (String terms : allTerms) {
            double tfidf = values[c][count];
            tfidf=(tfidf-minValue)/(maxValue-minValue);  //Normalization here
            mat[c][count]=Double.toString(tfidf);
            count++;   
        }     
        c++;
    }