Search code examples
pythondataframenlpnltkfrequency-distribution

Count total number of words in a corpus using NLTK's Conditional Frequency Distribution in Python (newbie)


I need to count the number of words (word appearances) in some corpus using NLTK package.

Here is my corpus:

corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')

Here is how I try to get the total number of words for each document:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

(I split strings into words manually, somehow it works better then using corpus.words(), but the problem remains the same, so it's irrelevant). Generally, this does the same (wrong) job:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.words(fileids=textname)])

This is what I get by typing cfd.appr.tabulate():

                        1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  
2022.12.06_Bild 2.txt   3  36 109  40  47  43  29  29  33  23  24  12   8   6   4   2   2   0   0   0   0   
2022.12.06_Bild 3.txt   2  42 129  59  57  46  46  35  22  24  17  21  13   5   6   6   2   2   2   0   0   
2022.12.06_Bild 4.txt   3  36 106  48  43  32  38  30  19  39  15  14  16   6   5   8   3   2   3   1   0   
2022.12.06_Bild 5.txt   1  55 162  83  68  72  46  24  34  38  27  16  12   8   8   5   9   3   1   5   1   
2022.12.06_Bild 6.txt   7  69 216  76 113  83  73  52  49  42  37  20  19   9   7   5   3   6   3   0   1   
2022.12.06_Bild 8.txt   0   2   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

But these are numbers of words of different length. What I need is just this (only one type of item (text) should be counted by number of words):

2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0
dtype: float64

I.e. the sum of all words of different length (or sum of columns that was composed using DataFrame(cfd_appr).transpose().sum(axis=1). (By the way, if there is some way to set up a name for this column that would also a solution, but .rename({None: 'W. appear.'}, axis='columns') is not working, and the solution would be generally not clear enough.

So, what I need is:

                             1    
2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0

Would be grateful for help!


Solution

  • Well, here is what was actually needed:

    First, get the numbers of words of different length (just as I did before):

    cfd_appr = nltk.ConditionalFreqDist(
        (textname, num_appr)
        for textname in corpus.fileids()
        for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])
    

    Then add import DataFrame as pd and add to_frame(1) to the dtype: float64 Series that I got by summing the columns:

    pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)
    

    That's it. However, if somebody knows how to sum them uo in the definition of cfd_appr, that would be some more elegant solution.