What does ($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) mean in the Moses Tokenizer?

Moses Tokenizer is the tokenizer widely used in machine translation and natural language processing experiments.

There is a line of regex that checks for:

if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || 
   ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || 
   ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))

Please correct me if I'm wrong, the 2nd and 3rd conditions are to check

whether the prefix is in a list of nonbreaking prefixes
whether the word is not the last token and there is still a lowercased token as the next word.

The question is on the first condition where it checks for:

($pre =~ /\./ && $pre =~ /\p{IsAlpha}/)

Is the $pre =~ /\./ checking whether the prefix is a single fullstop?
And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?
One related question is whether the fullstop is already inside the perluniprop alphabet? If so, wouldn't this condition never be true?

Solution

Please correct me if I'm wrong [about $NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1 checking] whether the prefix is in a list of nonbreaking prefixes

Can't tell without knowing what %NONBREAKING_PREFIX contains, but it's a fair guess.

Please correct me if I'm wrong [about $i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/) checking] whether the word is not the last token and there is still a lowercased token as the next word

Assuming the code is iterating over @words, and $i is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).

Is the $pre =~ /\./ checking whether the prefix is a single fullstop?

Not quite. It checks if any of the characters in the string in $pre is a FULL STOP.

$ perl -e'CORE::say "abc.def" =~ /\./ ? "match" : "no match"'
match

$ perl -e'CORE::say "abc!def" =~ /\./ ? "match" : "no match"'
no match

Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.

And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?

\p{IsAlpha} is indeed defined in perluniprops. [Note the correct spelling.] It defines

\p{Is_*}          ⇒   \p{*}
\p{Alpha}         ⇒   \p{XPosixAlpha}
\p{XPosixAlpha}   ⇒   \p{Alphabetic=Y}

\p{Alpha: *}      ⇒   \p{Alphabetic=*}
\p{Alphabetic}    ⇒   \p{Alphabetic=Y}

so \p{IsAlpha} is an alias for \p{Alphabetic=Y}^[1]. Unicode defines what characters are Alphabetic^[2]. There are quite a few:

$ unichars '\p{Alpha}' | wc -l
10391

So back to the question. $pre =~ /\p{IsAlpha}/ checks if any of the characters in the string in $pre is an alphabetic character.

One related question is whether the fullstop is already inside the perluniprop alphabet?

No.

$ perl -e'CORE::say "." =~ /\p{IsAlpha}/ ? "match" : "no match"'
no match

$ uniprops .
U+002E <.> \N{FULL STOP}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Case_Ignorable CI Common Zyyy Po P
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax
       PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation STerm Term
       Terminal_Punctuation Unicode X_POSIX_Punct

In contrast,

$ uniprops a
U+0061 <a> \N{LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
       ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
       IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
       POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
       X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS

If so, wouldn't this condition never be true?

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' .
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a.
match

Underscores and spaces are ignored, so \p{IsAlpha}, \p{Is_Alpha} and \p{I s_A l p_h_a} are all equivalent.
The list of alphabetic characters is slightly different than the list of letter characters.
```
$ unichars '\p{Letter}' | wc -l
9540

$ unichars '\p{Alpha}' | wc -l
10391
```
All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.