I am trying to sanitize a string and have ended up with the following:
Characterisation of the arsenic resistance genes in lt i gt Bacillus lt i gt sp UWC isolated from maturing fly ash acid mine drainage neutralised solids
I am trying to remove the lt, i, gt as those are reduced HTML entities which do not seem to be removed. What would be the best way to approach this or another solution that I could look at?
Here is my current solution for now:
/**
* @return string
*/
public function getFormattedTitle()
{
$string = preg_replace('/[^A-Za-z0-9\-]/', ' ', filter_var($this->getTitle(), FILTER_SANITIZE_STRING));
return $string;
}
And here is an example input string:
Assessing <i>Clivia</i> taxonomy using the core DNA barcode regions, <i>matK</i> and <i>rbcLa</i>
Thanks!
The telltale lt
and gt
in your output tell me that the string you have is actually more like:
"Assessing <i>Clivia</i> taxonomy using the core DNA barcode regions, <i>matK</i> and <i>rbcLa</i>"
when viewed as plain text.
The string you show above is what would show in a browser which would interpret '<' as '<' and '>' as '>'. (These are usually called "HTML entities" and offer a way to encode a character that would otherwise be interpreted as HTML.)
One option is to process like this:
$s = "Assessing <i>Clivia</i> taxonomy …";
$s = html_entity_decode($s); // $s is now "Assessing <i>Clivia</i> taxonomy …"
$s = strip_tags($s); // $s is now "Assessing Clivia taxonomy"
But do be aware that strip_tags is an exceedingly naïve function. For example it would turn '1<5 and 6>2' into '12'! So you need to be sure that all your input text is double-HTML encoded as the example is for it to work perfectly.