Search code examples
phparrayscounttext-filescpu-word

Count total and unique words from thousands of files


I have a large collection of text files over 5000 and there are more than 200,000 words. The problem is, when I try to combine the whole collection into a single array in order to find the unique words in the collection no output is shown(It is due to the very large size of array). The following piece of code works fine for small no. of collection e.g., 30 files but cannot operate on the very large collection. Help me fix this problem. Thanks

<?php
ini_set('memory_limit', '1024M');
$directory = "archive/";
$dir = opendir($directory);
$file_array = array(); 
while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {
    $contents = file_get_contents($filename);
    $text = preg_replace('/\s+/', ' ',  $contents);
    $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text);
    $text = explode(" ", $text);
    $text = array_map('strtolower', $text);
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
    $text = (array_diff($text,$stopwords));
    $file_array = array_merge($file_array,  $text);
  }
}
closedir($dir); 
$total_word_count = count($file_array);
$unique_array = array_unique($file_array);
$unique_word_count = count($unique_array);
echo "Total Words: " . $total_word_count."<br>";
echo "Unique Words: " . $unique_word_count;
?> 

Dataset of text files can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip


Solution

  • In stead of juggling with multiple arrays, just build one, and populate it only with the words and count them while you are inserting them. This will be faster, and you will even have the count per word.

    By the way, you also need to add the empty string to the list of stopwords, or adjust your logic to avoid taking that one in.

    <?php
    $directory = "archive/";
    $dir = opendir($directory);
    $wordcounter = array();
    while (($file = readdir($dir)) !== false) {
      if (filetype($directory . $file) == 'file') {
        $contents = file_get_contents($directory . $file);
        $text = preg_replace('/\s+/', ' ',  $contents);
        $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text);
        $text = explode(" ", $text);
        $text = array_map('strtolower', $text);
        foreach ($text as $word)
            if (!isset($wordcounter[$word]))
                $wordcounter[$word] = 1;
            else
                $wordcounter[$word]++;
      }
    }
    closedir($dir); 
    
    $stopwords = array("", "a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
    foreach($stopwords as $stopword)
        unset($wordcounter[$stopword]);
    
    $total_word_count = array_sum($wordcounter);
    $unique_word_count = count($wordcounter);
    echo "Total Words: " . $total_word_count."<br>";
    echo "Unique Words: " . $unique_word_count."<br>";
    
    // bonus:
    $max = max($wordcounter);
    echo "Most used word is used $max times: " . implode(", ", array_keys($wordcounter, $max))."<br>";
    ?>