Search code examples
pythonpandasscikit-learnnlpcorpus

How to create a corpus with a set of text files - python?


I have a set of document IDs (keys.csv) that I am using to get a set of text documents from a document source. I would like to collect all these text documents into a corpus for further analysis (like cosine similarity).

I am using the below code to append each text document into the corpus, but I'm not sure if this is going to work. Is there a better way to create a corpus with these text documents?

keys = pandas.read_csv(keys.csv)
for i in keys:
    ID = i
    doc = function_to_get_document(ID)
    corpus = corpus.append(doc)

Solution

  • If csv has column IDcol with unique ID use list comprehension, output is list:

    corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]
    

    Sample:

    print (pd.read_csv('keys.csv'))
       IDcol
    0      1
    1      2
    2      3
    
    def function_to_get_document(x):
        return x + 1
    
    corpus = [function_to_get_document(ID) for ID in pd.read_csv('keys.csv')['IDcol']]
    print (corpus)
    [2, 3, 4]