I got strange results using NSDataDetector and I am looking for insight in how it works.
Is it matching against an internal database or is it using any separation algorithm to detect the separate fields in string?
Currently, I am using the following code to detect the fields of an address:
NSDataDetector *address = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeAddress error:nil];
NSArray* matcheslinkaa = [address matchesInString:inputString options:0 range:NSMakeRange(0, [inputString length])];
if ([matcheslinkaa count]>0)
{
for (NSTextCheckingResult *match in matcheslinkaa)
{
if ([match resultType] == NSTextCheckingTypeAddress)
{
NSDictionary *phoneNumber = [match addressComponents];
NSLog(@"addressComponents %@",phoneNumber);
}
}
}
Following is a sample set of input strings and their respective outputs, using the above code:
inputString = @"100 Main Street\n"
"Anytown, NY 12345\n"
"USA";
// prints:
// addressComponents {
// City = Anytown;
// Country = USA;
// State = NY;
// Street = "100 Main Street";
// ZIP = 12345;
// }
inputString = @"A-205 Natasha Golf View\n"
"2 Inner Ring Road\n"
"Bangalore\n"
"560071\n"
"Karnataka";
// prints:
// addressComponents {
// City = Bangalore;
// Street = "2 Inner Ring Road";
// ZIP = 560071;
// }
inputString = @"A-205 Natasha Golf View\n"
"2 Inner Ring Road\n"
"Domlur\n"
"Bangalore\n"
"560071\n"
"India";
// prints:
// addressComponents {
// City = Bangalore;
// Street = "2 Inner Ring Road";
// ZIP = 560071;
// }
inputString = @"Dak Bhavan\n"
"Parliament Street\n"
"NEW DELHI 110001\n"
"INDIA";
// => `addressComponents` is empty!
As you can see, NSDataDetector has no problem to extract US-addresses. Why is it faring so much worse with Indian addresses that it doesn't even find the country name?
I cannot tell you how it works — the fact, that NSDataDetector
inherits NSRegularExpression
may suggest that it uses a set of regular expressions, but I honestly doubt that (e.g. the detector for date-types uses information that is sprinkled throughout longer blocks of text, so that it appears more likely that there is some natural language clustering and processing going on under the hood).
The main reason why it works better with American addresses, I suppose, is as simple as it is boring:
Apple is a US-based company and (with the exception of Jonathan Ive, who is British) every of its top-level executives is a North-American. Therefore, it's of little surprise that their approach is "US/North-American First" [1].
It's the reason why the design of the power-brick is so elegant when using the compact US connector (where the prongs fold in) — and looks so clumsy with almost any other...
The other reason is that Apple — like anyone else — ships as soon as they can:
If they have something working for their US customers but not for the rest, why not ship it for them and add support for other locales via software updates later?
With regards to your problem, what may or may not help (read: "I didn't bother testing") with the detection of addresses is that the user set the locale of their device appropriately.
If — and only if — you find out that this has a positive impact on your results, you could then check whether the country part of [[NSLocale currentLocale] localeIdentifier]
equals IN
and (in case it doesn't) prompt the user to change that in the "Settings" app, otherwise.
If that's not proving to be useful, you've got to Roll-Your-Own™...
(1) The major notable exception to this rule was the choice of the base-band technology for the original iPhone, where favoring GSM over CDMA may have been a disadvantage locally but the key to success globally.