Search code examples
pythonscikit-learntf-idf

Extracting n-grams from txt only returns the first lines


I'm a total newbie in ML and everything in it.

I have a ~15k log and my goal is to extract 3 to 8-grams from it. The code I'm using is partially adopted from this question.


    df = pd.read_fwf(r'C:\path\to\my\LOG.txt')
    vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(3,8))
    vect.fit(df)
    for w in vect.get_feature_names_out():
    print(w)

The code actually works, but I'm not able to "iterate" over the txt. The result of the execution only returns the first X n-grams extracted from the first 2-3 lines of the log. How can I read and extract all the n-grams from the document?

EXTRA QUESTION: Since the final goal is to extract the n-grams and build a tf-idf model on them, does the fact that my log is a TXT instead of CSV represent a problem? I have variable-lenght lines so CSV is not feasible I guess.


Solution

  • Use a for loop on a file object to read it line-by-line. Use with open(...) to let a context manager ensure that the file is closed after reading:

    with open("log.txt") as infile:
        for line in infile:
            print(line)