Search code examples
phpgoogle-cloud-platformgoogle-cloud-nl

How to generate a copy of the original text including the highlighted entities with PHP using results from Google Natural Language API


I am processing some text with the Google Natural Language API and the PHP client libraries, and I would like to generate a copy of the original text that replicates the format that you can see in the screen cap from the Google Natural Language try out page where the entities are highlighted and have an index that relates them to their parent term:

Screen cap from sample output

From the results, I have the entity name, the mentions, and the beginOffset. For instance:

array (
    'name' => 'Google Cloud Natural Language API',
    'type' => 'OTHER',
    'metadata' => 
    array (
      'mid' => '/g/11bc5pm43l',
      'wikipedia_url' => 'https://pl.wikipedia.org/wiki/NAPI_(API)',
    ),
    'salience' => 0.045935749999999997417177155512035824358463287353515625,
    'mentions' => 
    array (
      0 => 
      array (
        'text' => 
        array (
          'content' => 'Google Cloud Natural Language API',
          'beginOffset' => 90,
        ),
        'type' => 'PROPER',
        'sentiment' => 
        array (
          'magnitude' => 0.90000000000000002220446049250313080847263336181640625,
          'score' => 0.90000000000000002220446049250313080847263336181640625,
        ),
      ), //and so on

I am extracting the main variables from the results to amend the original text to include the relevant information from the results:

  array (
    0 => 'Google Cloud Natural Language API', //The parent term
    1 => 'OTHER',
    2 => 0.045935749999999997417177155512035824358463287353515625,
    3 => 1.600000000000000088817841970012523233890533447265625,
    4 => 0,
    5 => 16, //This term has 16 associated mentions
    6 => 
    array ( //Array containing all of the associated mentions
      0 => 'Google Cloud Natural Language API',
      1 => 'Natural Language API',
      2 => 'Natural Language API',
      3 => 'Natural Language API',
      4 => 'Natural Language API',
      5 => 'REST API',
      6 => 'Natural Language API',
      7 => 'Natural Language API',
      8 => 'Natural Language API',
      9 => 'Natural Language API',
      10 => 'Natural Language API',
      11 => 'Natural Language API',
      12 => 'Natural Language API',
      13 => 'Natural Language API',
      14 => 'Natural Language API',
      15 => 'Natural Language API',
    ),
    7 => 
    array ( //Array containing the beginOffset of each associated mention
      0 => 90,
      1 => 196,
      2 => 321,
      3 => 463,
      4 => 2421,
      5 => 2447,
      6 => 2946,
      7 => 6167,
      8 => 6414,
      9 => 8958,
      10 => 12039,
      11 => 12168,
      12 => 12256,
      13 => 13179,
      14 => 13294,
      15 => 13802,
    ),
  ),

Till now, I have tried this:

<?php

# Open file
$myfile = fopen("sampleText.txt", "r") or die("Unable to open file!");
$data = fread($myfile, filesize("sampleText.txt"));
fclose($myfile);

echo 'Original Text: <br>';
echo $data;
echo '<br>';
echo '<br>';

# Entities occurrence List
$entitiesList = array (
      0 => 'Google Cloud Natural Language API',
      1 => 'Natural Language API',
      2 => 'Natural Language API',
      3 => 'Natural Language API',
      4 => 'Natural Language API',
      5 => 'REST API',
      6 => 'Natural Language API',
      7 => 'Natural Language API',
      8 => 'Natural Language API',
      9 => 'Natural Language API',
      10 => 'Natural Language API',
      11 => 'Natural Language API',
      12 => 'Natural Language API',
      13 => 'Natural Language API',
      14 => 'Natural Language API',
      15 => 'Natural Language API',
    );

# Samples of ofsetts
$ofsettList = array (
      0 => 90,
      1 => 196,
      2 => 321,
      3 => 463,
      4 => 2421,
      5 => 2447,
      6 => 2946,
      7 => 6167,
      8 => 6414,
      9 => 8958,
      10 => 12039,
      11 => 12168,
      12 => 12256,
      13 => 13179,
      14 => 13294,
      15 => 13802,
    );

# Size of ofsetts List
$ofsettListLenght = sizeof($ofsettList);

# Index of the entity in the returned results
$index = 1;

# Temporal values array with new formatted string
$tempAmendedEntity = [];

for($i = 0; $i < $ofsettListLenght; $i++) {
    $tempAmendedEntity[] = '('. $entitiesList[$i] . ')' . $index;
}

echo 'List of new amended  Strings';
echo '<pre>', var_export($tempAmendedEntity, true), '</pre>', "\n";
echo '<br>';
echo '<br>';

// Method 1
for($i = 0; $i < $ofsettListLenght; $i++) {
    $temp1 = str_replace(substr($data, $ofsettList[$i], strlen($entitiesList[$i])), $tempAmendedEntity[$i] , $data);    
}

echo 'Text after method 1: <br>';
echo '<pre>', var_export($temp1, true), '</pre>', "\n";
echo '<br>';
echo '<br>';


// Method 2
$keyPairArray = [];

for($i = 0; $i < $ofsettListLenght; $i++) {

    $keyPairArray[$entitiesList[$i]] = $tempAmendedEntity[$i];
}

echo 'List of key => value strings';
echo '<pre>', var_export($keyPairArray, true), '</pre>', "\n";
echo '<br>';
echo '<br>';

$temp2 = strtr($data, $keyPairArray);

echo 'Text after method 2: <br>';
echo '<pre>', var_export($temp2, true), '</pre>', "\n";

With method 1: Not all the new string values are replaced, for example (Google Cloud Natural Language API)1 and (REST API)1 are not present in the result.

With method 2:Replaces all the matched to the original string with the new one, but that includes the occurrences of the matching string which are not related with the parent term.

It would be great to be able to replace only the string which starts at a specific 'beginOffset' for the new modified string.

The text I am using for testing can be downloaded from here: sampleText.txt


Solution

  • I found this useful for I project I was working on because when you receive the answer from Google Natural Language, the JSON response contains all the mentions of an entity and the offset within the original text, but it is quite difficult to visualize where in the text is that specific mention is. So I got all the mentions returned and sorted in a key => value pair array and then used this two functions to reconstruct the original text including the offset of each word as a unique identifier, to easily find where is that word within the text.

    $originalString is the entity value from the key => value entity array, and $originalOfsett is the key. Then I formed a new string including an HTML span tag for styling:

    function stringReplacementInfo($originalString, $originalOfsett) {
    
        $modifiedString = " <span id=\"entity\">(" . $originalString . ")</span><span id=\"index\">" . $originalOfsett . "</span> ";
        replaceString($originalString, $originalOfsett, $modifiedString);
    }
    

    then, I replace the modified string in the original text one by one:

    function replaceString($originalString, $originalOfsett, $modifiedString) {
    
        global $data;
    
        $originalStringLenght = mb_strlen($originalString, '8bit');
    
        $preChunk = '';
        $postChunk = '';
        $chunk = '';
    
        $preChunk = substr($data, 0, $originalOfsett);
        $chunk = substr($data, $originalOfsett, $originalStringLenght);
        $postChunk = substr($data, $originalOfsett + $originalStringLenght);
        $data = $preChunk . $modifiedString . $postChunk;
    }
    

    and then I used a bit of CSS to make it look a bit more like the results from the Natural Language API web test version:

    #index{
        font-size: 0.7em;
        color: #3d1766;
    }
    

    making it easier to find every single returned mention of an entity by its offset value, and the final result looks like this:

    enter image description here

    Hope this helps someone else.