I'm trying to extract all words from a string into an array, but i am having some problems with spaces (
).
This is what I do:
//Clean data to text only
$data = strip_tags($data);
$data = htmlentities($data, ENT_QUOTES, 'UTF-8');
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');
//Clean up text from special chrs I don't want as words
$data = str_replace(',', '', $data);
$data = str_replace('.', '', $data);
$data = str_replace(':', '', $data);
$data = str_replace(';', '', $data);
$data = str_replace('*', '', $data);
$data = str_replace('?', '', $data);
$data = str_replace('!', '', $data);
$data = str_replace('-', ' ', $data);
$data = str_replace("\n", ' ', $data);
$data = str_replace("\r", ' ', $data);
$data = str_replace("\t", ' ', $data);
$data = str_replace("\0", ' ', $data);
$data = str_replace("\x0B", ' ', $data);
$data = str_replace(" ", ' ', $data);
//Clean up duplicated spaces
do {
$data = str_replace(' ', ' ', $data);
} while(strpos($data, ' ') !== false);
//Make array
$clean_data = explode(' ', $data);
echo "<pre>";
var_dump($clean_data);
echo "</pre>";
This outputs:
array(58) {
[0]=>
string(5) " "
[1]=>
string(5) " "
[2]=>
string(11) "anläggning"
[3]=>
string(3) "med"
[4]=>
string(3) "den"
[5]=>
string(10) "erfarenhet"
[6]=>
string(3) "som"
}
If i check source for output i see that the first 2 array values is
.
No matter how I try, I can't remove this from the string. Any ideas?
UPDATE:
After some tweaking with code i manage to get following output:
array(56) {
[0]=>
string(1) "�" //Notice change. Instead of string length 5 it now says 1. But still its garbage.
[1]=>
string(1) "�"
[2]=>
string(11) "anläggning"
[3]=>
string(3) "med"
[4]=>
string(3) "den"
[5]=>
string(10) "erfarenhet"
[6]=>
string(3) "som"
[7]=>
string(5) "finns"
[8]=>
string(4) "inom"
Thanks!
ANSWER (for lazy people):
Even thou this is a slightly different approach to the problem, and it never really answers why I had the problems I had above (like leftover
and other extra weird spaces), I like it and it is a lot better than my original code.
Thanks to all who contributed to this!
//Clean data to text only
$data = strip_tags($data);
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');
//Clean up text from special chrs
$data = str_replace(array("-"), ' ', $data);
$clean_data = str_word_count($data, 1, 'äöå');
echo "<pre>";
var_dump($clean_data);
echo "</pre>";
Ok, the only thing you would have to do is to replace
with a space as you already do (only if the string really still contains
check @Andy E's answer to make sure that that your data does not contain any HTML entities.):
$data = str_replace(" ", ' ', $data);
Then you can use str_word_count
to get the words:
$words = str_word_count($data, 1, 'äöåÄÖÅ');
P.S.: What is the sense of calling htmlentities
first and then revert it again in with html_entity_decode
anyway?
Update: Example:
$str = ' anläggning med den erfahrenhet som åååÅ ÅÅ';
print_r(str_word_count($str, 1, 'äöåÄÖÅ'));
prints
Array
(
[0] => anläggning
[1] => med
[2] => den
[3] => erfahrenhet
[4] => som
[5] => åååÅ
[6] => ÅÅ
)
Reading documentation helps :)