Search code examples
phpstring-parsing

Parsing a string into parts, only consecutive words, not a power set


I'm trying to write a search query to find articles from a database. I would like to take the search string the user enters and look for a specific set of possible search terms. If the user entered the search string "listing of average salaries in germany for 2011" I would like to generate a list of terms to hunt for. I figured I would look for the whole string and for partial strings of consecutive words. That is I want to search for "listing of average salaries" and "germany for 2011" but not "listing germany 2011".

So far I have this bit of code to generate my search terms:

  $searchString = "listing of average salaries in germany for 2011";
  $searchTokens = explode(" ", $searchString);
  $searchTerms = array($searchString);

  $tokenCount = count($searchTokens);
  for($max=$tokenCount - 1; $max>0; $max--) {
      $termA = "";
      $termB = "";
      for ($i=0; $i < $max; $i++) {
          $termA .= $searchTokens[$i] . " ";
          $termB .= $searchTokens[($tokenCount-$max) + $i] . " ";
      }
      array_push($searchTerms, $termA);
      array_push($searchTerms, $termB);
  }

  print_r($searchTerms);

and its giving me this list of terms:

  • listing of average salaries in germany for 2011
  • listing of average salaries in germany for
  • of average salaries in germany for 2011
  • listing of average salaries in germany
  • average salaries in germany for 2011
  • listing of average salaries in
  • salaries in germany for 2011
  • listing of average salaries
  • in germany for 2011
  • listing of average
  • germany for 2011
  • listing of
  • for 2011
  • listing
  • 2011

What I'm not sure how to get are the missing terms:

  • of average salaries in germany for
  • of average salaries in germany
  • average salaries in germany for
  • of average salaries in
  • average salaries in germany
  • salaries in germany for
  • etc...

Update

I'm not looking for a "power set" so answers like this or this aren't valid. For example I do not want these in my list of terms:

  • average germany
  • listing salaries 2011
  • of germany for

I'm looking for consecutive words only.


Solution

  • You want to find all sequential subsets of the exploded string, just start at offset=0 and split the array with length=1 up to count-offset:

    $search_string = 'listing of average salaries in germany for 2011';
    $search_array = explode(' ',$search_string);
    $count = count($search_array);
    
    $s = array();
    $min_length = 1;
    
    for ($offset=0;$offset<$count;$offset++) {
        for ($length=$min_length;$length<=$count-$offset;$length++) {
            $match = array_slice($search_array,$offset,$length);
            $search_matches []= join(' ',$match);
        }
    }
    
    print_r($search_array);
    print_r($search_matches);