java data-mining information-retrieval text-mining

Information Gain Calculation for a text file?

I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing(Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part.

my out file contain word and there TFIDF value.

like WORD - TFIDF VALUE

together(word) - 0.235(tfidf value)

come(word) - 0.2548(tfidf value)

when using weka for information gain ("InfoGainAttributeEval.java") it require .arff file format as input.

Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka?

Is there any other open source for Calculating information gain for document ?

Solution

I found my answer. In this we have to generate arff file.

In .arff file

@RELATION section will contain all words present in your whole document after preprocessing .Each word will be of type real because tfidf value is a real value.

@data section will contain their tfidf value calculated during preprocessing. for example first will contain tfidf value all words present in first document an at last colunm the document categary.

@RELATION filename
@ATTRIBUTE word1 real
@ATTRIBUTE word2 real
@ATTRIBUTE word3 real
.
.
.
.so on
@ATTRIBUTE class {cacm,cisi,cran,med}

@data
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.55454479562,0.1619617,0.579562,0.5542,cisi
0.5545479562,0.27,0.554544479562,0.4479562,cisi
0.0,0.2396113617,0.44479562,0.2,cran
0.5545479562,0.27,0.554544479562,0.4479562,carn
0.5545177444479562,0.26196113617,0.0,0.0,med
0.5545479562,0.27,0.554544479562,0.4479562,med

after you generate this file you can give this file as input to InfoGainAttributeEval.java. and this working for me.