Search code examples
pythonpandasdataframenlp

Count the number of occurrences of each word in a file and load into pandas


How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?


Solution

  • Use nltk:

    # pip install nltk
    from nltk.tokenize import RegexpTokenizer
    from nltk import FreqDist
    import pandas as pd
    
    text = """How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?"""
    
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(text)
    
    sr = pd.Series(FreqDist(words))
    

    Output:

    >>> sr
    How            1
    do             1
    I              1
    count          3
    the            3
    number         1
    of             2
    occurrences    1
    each           1
    word           1
    in             1
    a              1
    txt            1
    file           1
    and            2
    also           2
    load           1
    it             1
    into           1
    pandas         1
    dataframe      2
    with           1
    columns        1
    name           1
    sort           1
    on             1
    column         1
    dtype: int64