Search code examples
phpregexpreg-match

preg_match exclude strings


From 10,000 lines of data I have to get all the lines that don't contain words that START like "en" or "it" or "de" etc.., that are from 2 to 5 long a-z and A-Z with "-" too (minus sign) and ";"

I tried this but doesn't work

 !preg_match("/\b(it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/", $value)

this would be read (to me) don't match all the lines have words that start with it, en, etc. are composed of 2 to 5 chars and in those 5 chars can contain also "-" or ";".

This returns me lines with "it;" which I need to exclude.

EDIT: I need to match every word that starts with those 2 characters (it or en or de) and can be everywhere in the line

Example to match (it doesn't contain words that start with "en", "de", etc.)

GET; SITE; 15:03:03; ; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1; .NET4.0C); 

Example not to match (it does contain a word that start with "en")

GET; SITE; 13:06:49; ; Mozilla/4.0 (compatible; **en;** MSIE 8.0; Windows NT 6.1; Trident/4.0; SIMBAR={E76F6580-EB92-49A3-A089-F6B8B9DEA9AA}; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; eSobiSubscriber 2.0.4.16; Media Center PC 5.0; SLCC1; .NET4.0C); ; 

Solution

  • As far as I can tell, your regex matches strings that start with one of the country codes and have a total length of 4 - 7, not 2 - 5. So en; does not match because it only contains three symbols. The {2,5} applies only to the expression to its immediate left, so your regex reads "A word that starts with it/en/de etc. and continues with between two and five letters/dashes/semicolons." Try \b(it|en|de|es|fr|ru)[a-zA-Z-;]{0,3}.

    You might also want to be explicit about the semicolon being the last character, and perhaps also be more specific about the structure of the ISO language codes (which I assume that these strings are): \b(it|en|de|es|fr|ru)(-[a-zA-Z]{2})?;?\b. Here, we say "A word that starts with it/en/de etc. and might continue with a dash and two letters, and (irrespective of whether it had the dash and two letters) might continue with a semicolon. Nothing else will be allowed before the word should end."