I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in the following code? Thanks in advance for any suggestion?
import textmining
def term_document_matrix_roy_1():
'''-----------------------------------------'''
with open("data_set.txt") as f:
reading_file_line = f.readlines() #entire content, return list
print reading_file_line #list
reading_file_info = [item.rstrip('\n') for item in reading_file_line]
print reading_file_info
print reading_file_info [1] #list-1
print reading_file_info [2] #list-2
'''-----------------------------------------'''
tdm = textmining.TermDocumentMatrix()
#tdm.add_doc(reading_file_info) #Giving error because of readlines
tdm.add_doc(reading_file_info[0])
tdm.add_doc(reading_file_info[1])
tdm.add_doc(reading_file_info[2])
for row in tdm.rows(cutoff=1):
print row
Sample Text files: "data_set.txt" contain following information:
Lets write some python code
Thus far, this book has mainly discussed the process of ad hoc retrieval.
Along the way we will study some important machine learning techniques.
Output will be Term Document Matrix, basically how many times one specific word appear. Output Image: http://postimg.org/image/eidddlkld/
@Fred Thanks for reply. I want to show as it I showed in the image file. Actually the same result I able to produce using following code, but I want each line as separate matrix not one matrix.
with open("txt_files/input_data_set.txt") as f:
reading_file_info = f.read()#reading lines exact content
reading_file_info=f.read
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(reading_file_info)
tdm.write_csv('txt_files/input_data_set_result.txt', cutoff=1)
for row in tdm.rows(cutoff=1):
print row
What I'm trying is reading a text file and create a term document matrix.