I need to count the number of words (word appearances) in some corpus using NLTK package.
Here is my corpus:
corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')
Here is how I try to get the total number of words for each document:
cfd_appr = nltk.ConditionalFreqDist(
(textname, num_appr)
for textname in corpus.fileids()
for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])
(I split strings into words manually, somehow it works better then using corpus.words()
, but the problem remains the same, so it's irrelevant). Generally, this does the same (wrong) job:
cfd_appr = nltk.ConditionalFreqDist(
(textname, num_appr)
for textname in corpus.fileids()
for num_appr in [len(w) for w in corpus.words(fileids=textname)])
This is what I get by typing cfd.appr.tabulate()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
2022.12.06_Bild 2.txt 3 36 109 40 47 43 29 29 33 23 24 12 8 6 4 2 2 0 0 0 0
2022.12.06_Bild 3.txt 2 42 129 59 57 46 46 35 22 24 17 21 13 5 6 6 2 2 2 0 0
2022.12.06_Bild 4.txt 3 36 106 48 43 32 38 30 19 39 15 14 16 6 5 8 3 2 3 1 0
2022.12.06_Bild 5.txt 1 55 162 83 68 72 46 24 34 38 27 16 12 8 8 5 9 3 1 5 1
2022.12.06_Bild 6.txt 7 69 216 76 113 83 73 52 49 42 37 20 19 9 7 5 3 6 3 0 1
2022.12.06_Bild 8.txt 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
But these are numbers of words of different length. What I need is just this (only one type of item (text) should be counted by number of words):
2022.12.06_Bild 2.txt 451.0
2022.12.06_Bild 3.txt 538.0
2022.12.06_Bild 4.txt 471.0
2022.12.06_Bild 5.txt 679.0
2022.12.06_Bild 6.txt 890.0
2022.12.06_Bild 8.txt 3.0
dtype: float64
I.e. the sum of all words of different length (or sum of columns that was composed using DataFrame(cfd_appr).transpose().sum(axis=1)
. (By the way, if there is some way to set up a name for this column that would also a solution, but .rename({None: 'W. appear.'}, axis='columns')
is not working, and the solution would be generally not clear enough.
So, what I need is:
1
2022.12.06_Bild 2.txt 451.0
2022.12.06_Bild 3.txt 538.0
2022.12.06_Bild 4.txt 471.0
2022.12.06_Bild 5.txt 679.0
2022.12.06_Bild 6.txt 890.0
2022.12.06_Bild 8.txt 3.0
Would be grateful for help!
Well, here is what was actually needed:
First, get the numbers of words of different length (just as I did before):
cfd_appr = nltk.ConditionalFreqDist(
(textname, num_appr)
for textname in corpus.fileids()
for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])
Then add import DataFrame
as pd
and add to_frame(1)
to the dtype: float64
Series that I got by summing the columns:
pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)
That's it. However, if somebody knows how to sum them uo in the definition of cfd_appr
, that would be some more elegant solution.