Moses Tokenizer is the tokenizer widely used in machine translation and natural language processing experiments.
There is a line of regex that checks for:
if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) ||
($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) ||
($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
Please correct me if I'm wrong, the 2nd and 3rd conditions are to check
The question is on the first condition where it checks for:
($pre =~ /\./ && $pre =~ /\p{IsAlpha}/)
Is the $pre =~ /\./
checking whether the prefix is a single fullstop?
And is $pre =~ /\p{IsAlpha}/
checking whether the prefix is an alpha from the list of alphabet in the perluniprop?
One related question is whether the fullstop is already inside the perluniprop alphabet? If so, wouldn't this condition never be true?
Please correct me if I'm wrong [about
$NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1
checking] whether the prefix is in a list of nonbreaking prefixes
Can't tell without knowing what %NONBREAKING_PREFIX
contains, but it's a fair guess.
Please correct me if I'm wrong [about
$i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)
checking] whether the word is not the last token and there is still a lowercased token as the next word
Assuming the code is iterating over @words
, and $i
is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).
Is the
$pre =~ /\./
checking whether the prefix is a single fullstop?
Not quite. It checks if any of the characters in the string in $pre
is a FULL STOP.
$ perl -e'CORE::say "abc.def" =~ /\./ ? "match" : "no match"'
match
$ perl -e'CORE::say "abc!def" =~ /\./ ? "match" : "no match"'
no match
Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.
And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?
\p{IsAlpha}
is indeed defined in perluniprops. [Note the correct spelling.] It defines
\p{Is_*} ⇒ \p{*}
\p{Alpha} ⇒ \p{XPosixAlpha}
\p{XPosixAlpha} ⇒ \p{Alphabetic=Y}
\p{Alpha: *} ⇒ \p{Alphabetic=*}
\p{Alphabetic} ⇒ \p{Alphabetic=Y}
so \p{IsAlpha}
is an alias for \p{Alphabetic=Y}
[1]. Unicode defines what characters are Alphabetic[2]. There are quite a few:
$ unichars '\p{Alpha}' | wc -l
10391
So back to the question. $pre =~ /\p{IsAlpha}/
checks if any of the characters in the string in $pre
is an alphabetic character.
One related question is whether the fullstop is already inside the perluniprop alphabet?
No.
$ perl -e'CORE::say "." =~ /\p{IsAlpha}/ ? "match" : "no match"'
no match
$ uniprops .
U+002E <.> \N{FULL STOP}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Case_Ignorable CI Common Zyyy Po P
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax
PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation STerm Term
Terminal_Punctuation Unicode X_POSIX_Punct
In contrast,
$ uniprops a
U+0061 <a> \N{LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
If so, wouldn't this condition never be true?
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a
no match
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' .
no match
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a.
match
Underscores and spaces are ignored, so \p{IsAlpha}
, \p{Is_Alpha}
and \p{I s_A l p_h_a}
are all equivalent.
The list of alphabetic characters is slightly different than the list of letter characters.
$ unichars '\p{Letter}' | wc -l
9540
$ unichars '\p{Alpha}' | wc -l
10391
All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.