Search code examples
pythonmatrixterm-document-matrix

Creating a Term Document Matrix from Text File


I'm trying to read one text file and create a term document matrix using textmining packages. I can create term document matrix where I need to add each line by line. The problem is that I want to include whole file at a time. What am I missing in the following code? Thanks in advance for any suggestion?

import textmining

def term_document_matrix_roy_1():

    '''-----------------------------------------'''
    with open("data_set.txt") as f:
        reading_file_line = f.readlines() #entire content, return  list 
        print reading_file_line #list
        reading_file_info = [item.rstrip('\n') for item in reading_file_line]
        print reading_file_info
        print reading_file_info [1] #list-1
        print reading_file_info [2] #list-2

        '''-----------------------------------------'''
        tdm = textmining.TermDocumentMatrix()
        #tdm.add_doc(reading_file_info) #Giving error because of readlines 
        tdm.add_doc(reading_file_info[0])       
        tdm.add_doc(reading_file_info[1])
        tdm.add_doc(reading_file_info[2])


        for row in tdm.rows(cutoff=1):
            print row

Sample Text files: "data_set.txt" contain following information:

Lets write some python code

Thus far, this book has mainly discussed the process of ad hoc retrieval.

Along the way we will study some important machine learning techniques.

Output will be Term Document Matrix, basically how many times one specific word appear. Output Image: http://postimg.org/image/eidddlkld/

enter image description here


Solution

  • @Fred Thanks for reply. I want to show as it I showed in the image file. Actually the same result I able to produce using following code, but I want each line as separate matrix not one matrix.

    with open("txt_files/input_data_set.txt") as f:
        reading_file_info = f.read()#reading lines exact content
        reading_file_info=f.read 
        tdm = textmining.TermDocumentMatrix()
        tdm.add_doc(reading_file_info)
    
        tdm.write_csv('txt_files/input_data_set_result.txt', cutoff=1)
        for row in tdm.rows(cutoff=1):
            print row
    

    What I'm trying is reading a text file and create a term document matrix.