Search code examples
stringmatlabsortingcell-array

Cell array of strings not fully sorted?


I have the following Matlab function I'm working on:

function [data] = ReadAndCountWords(fileName)
fid = fopen(fileName);
data = textscan(fid, '%s');
data = sort(data{1});

for i = 1:length(data)
    str = data{i};
    str = lower(str(isstrprop(str, 'alpha')));
    disp(str);
end
fclose(fid);
end

Right now I am passing in a text document containing The Gettysburg Address, and I want to print out the words contained in that file in order of how many times the word occurs. To get the word count I figured I would sort the cell array and then do a string comparison within my loop since it seemed simple enough. So I tried sorting my cell array with both sortrows() and sort(), but the results are the same:

but
four
god
it
it
it
now
the
the
we
we
a
a
a
a
a
a
a
above
add
advanced
ago
all
altogether
and
and
and
and
and
and
any
are
are
are
as
battlefield
be
be
before
birth
brave
brought
but
by
can
can
cannot
cannot
cannot
cause
civil
come
conceived
conceived
consecrate
consecrated
continent
created
dead
dead
dead
dedicate
dedicate
dedicated
dedicated
dedicated
dedicated
detract
devotion
devotion
did
died
do
earth
endure
engaged
equal
far
far
fathers
field
final
fitting
for
for
for
for
for
forget
forth
fought
freedom
from
from
full
gave
gave
government
great
great
great
ground
hallow
have
have
have
have
have
here
here
here
here
here
here
here
here
highly
honored
in
in
in
in
increased
is
is
is
it
it
larger
last
liberty
little
live
lives
living
living
long
long
measure
men
men
met
might
nation
nation
nation
nation
nation
never
new
new
nobly
nor
not
not
note
of
of
of
of
of
on
on
or
or
our
our
people
people
people
perish
place
poor
portion
power
proper
proposition
rather
rather
remaining
remember
resolve
resting
say
score
sense
seven
shall
shall
shall
should
so
so
so
struggled
take
task
testing
that
that
that
that
that
that
that
that
that
that
that
that
that
the
the
the
the
the
the
the
the
the
their
these
these
they
they
they
this
this
this
this
those
thus
to
to
to
to
to
to
to
to
under
unfinished
us
us
us
vain
war
war
we
we
we
we
we
we
we
we
what
what
whether
which
which
who
who
who
will
work
world
years

Why are those first 11 words out of order? I did some research on it and couldn't find anyone having the same problem, and the Matlab documentation seems to be doing it the same way I am. Any suggestions?


Solution

  • To count number of occurrences you can also use unique as follows:

    data = {'and' 'And, ' 'cut' 'be.' 'dear' 'be' 'eggs' 'egg'}; %// example data
    data = regexprep(lower(data), '[^a-z]', ''); %// make lower and remove special chars
    [words, ~, labels] = unique(data);
    count = histc(labels, 1:max(labels));
    

    Result:

    words = 
        'and'    'be'    'cut'    'dear'    'egg'    'eggs'
    
    count =
         2     2     1     1     1     1