Search code examples
phpregexpreg-match

Find a substring based on specific format (7 digits then 1 uppercase letter)


In my OCR text I am getting an output like this...

"responses": [
{
  "textAnnotations": [
    {
      "locale": "fr",
      "description": "3160 6392682B\nrinlraction\nE叠\narlairs&Lei sot les infractions provinsiales, Cour de jnstice de Ontarie Regt.

I want to fetch the "6392682B" value (8 alpha numeric characters). The numbers and the last character will vary in different images. The only standard part is its length which is 8 characters (first 7 will be numbers and the last one will be a letter).

I tried with:

preg_match_all("/(\d{7})/", $str, $ar);

First 7 numeric values and last one alphabet.


Solution

  • $description = "3160 6392682B\nrinlraction\nE叠\narlairs&Lei sot les infractions provinsiales, Cour de jnstice de Ontarie Regt.";
    

    Literally match 7 digits then 1 uppercase alphabetical character: (Demo)

    echo preg_match('/\d{7}[A-Z]/',$description,$out)?$out[0]:'not found';
    

    If you know that your substring immediately follows the first string of numbers and a space:

    echo preg_match('/\d+ \K\d{7}[A-Z]/',$description,$out)?$out[0]:'not found';
    

    If you need to set some boundaries so that there is not leading or trailing characters on the substring:

    echo preg_match('/\b\d{7}[A-Z]\b/',$description,$out)?$out[0]:'not found';
    

    This will check that the sequence of digits is not 8 or more and that there is not an alphanumeric character or underscore following the uppercase letter of your desired substring.

    If you know the position of your substring, you can even match it based on the characters that preceed & trail it:

    echo preg_match('/ \K[^\n]+/',$description,$out)?$out[0]:'not found';
    

    Some additional clarifications:

    \K signifies where to start the fullstring match, so there is no need for a capture group.

    \b is a regex metacharacter called a "word boundary".

    Using an anchor ^ at the start of the string is only beneficial if you KNOW that your desired substring follows the string of numbers and a space.

    The unicode flag is unnecessary because your PATTERN isn't using any unicode characters.

    You may test my patterns at regex101.com Selecting the most accurate and efficient pattern can only be determined after fully understanding the variability of your input string ($description). I will not make any assumptions about substring position in the string.

    The only thing that I can be absolutely sure of is the matching pattern based on your provided details: 7-digits then 1 uppercase letter. That is exactly what my first, second, and third patterns do.

    Francesco's first pattern will match: AAAAAAAAAAAAAAAAA, 11111111111111111111, 1A2S3D4F5G6H7J8K9L0 Francesco's second pattern will match: ZZZZZZZZ, 99999999, A1B2C3D4

    This makes his pattern inaccurate / bad / misleading, and likely to teach future SO readers poor regex practices ...not to mention potentially foul up your project!