I'm trying to extract all the nouns from a tokenized document and select the top 3. It's not working, I suspect because I am not using the strcmp command correctly. This is my code
sT2 = tokenizedDocument([
"a strongly worded collection of words and letters"
"another collection of words"]);
tD = tokenizedDocument(sT2);
tD = addPartOfSpeechDetails(tD);
tdetails = tokenDetails(tD);
td7 = table2cell(tdetails(:,7)); % PARTS OF SPEECH
siztd7 = size(td7);
cc = 1;
for ii = 1:siztd7
if strcmp(td7(ii,1), 'noun') == 1
tDNoun(cc) = tdetails(1,:);
cc = cc + 1;
end
end
bag = bagOfWords(tDNoun);
tb100 = topkwords(bag,3)
The variable tdetails
is a MATLAB table
, and you can extract the nouns directly from that using table
indexing, like this:
nouns = tdetails{tdetails.PartOfSpeech == "noun", "Token"}
The first subscript matches the table
variable PartOfSpeech
against "noun", and the second subscript extracts only the table
variable "Token". The use of brace indexing, i.e. {}
extracts the data - in this case a string
array of the words.
This can then be used directly with bagOfWords
, although we must transpose
the array nouns
to get a row vector as required by that function:
bag = bagOfWords(nouns')
topkwords(bag, 3)