I need help with speeding up a function to compute the proportion repeating digits (ignoring any non-digit). The function helps identify fake entries from a user, before running any check-digit verification (if such is even available). Think fake phone number, fake student number, fake checking account number, fake credit card number, fake any identifier, and so on.
The function is a generalization from this post.
Here is what it does. For the specified number of most frequently appearing digits, it computes the proportion of top digits to all digits in a string, ignoring all non-digits. If there are no digits in a string, it returns 1.0. All calculations are done on a vector of strings.
library(microbenchmark)
V = c('(12) 1221-12121,one-twoooooooooo', 'twos:22-222222222', '34-11111111, ext.123',
'01012', '123-456-789 valid', 'no digits', '', NaN, NA)
Fake_Similarity = function(V, TopNDigits) {
vapply(V, function(v) {
freq = sort(tabulate(as.integer(charToRaw(v)))[48:57], decreasing = T);
ratio = sum(freq[1:TopNDigits], na.rm = T) / sum(freq, na.rm = T)
if (is.nan(ratio)) ratio = 1
ratio
},
double(1))
}
t(rbind(Top1Digit = Fake_Similarity(v, 1), Top2Digits = Fake_Similarity(v, 2), Top3Digits = Fake_Similarity(v, 3)))
microbenchmark(Fake_Similarity(v, 2))
with the output. The labels are not important, but the order ratios must match the original order of corresponding strings.
Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545 1.0000000 1.0000000
twos:22-222222222 1.0000000 1.0000000 1.0000000
34-11111111, ext.123 0.6923077 0.8461538 0.9230769
01012 0.4000000 0.8000000 1.0000000
123-456-789 valid 0.1111111 0.2222222 0.3333333
no digits 1.0000000 1.0000000 1.0000000
1.0000000 1.0000000 1.0000000
NaN 1.0000000 1.0000000 1.0000000
<NA> 1.0000000 1.0000000 1.0000000
Unit: milliseconds
expr min lq mean median uq max neval
Fake_Similarity(v, 2) 1.225418 1.283113 1.305139 1.292755 1.304262 1.769703 100
For example, twos:22-222222222
has 11 digits and all of them are the same. So, for the Top1Digit
we have 11/11=1, for the Top2Digits
we have (11+0)/11=1 again, and so on. In other words, this is a fake number by any measure. It is highly unlikely for, let's say, a person's phone number to have identical digits, including the area code.
You can use this Rcpp function:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double prop_top_digit(const RawVector& x, int top_n_digits) {
// counts occurence of each character
IntegerVector counts(256);
RawVector::const_iterator it;
for(it = x.begin(); it != x.end(); ++it) counts[*it]--;
// partially sort first top_n_digits (negative -> decreasing)
IntegerVector::iterator it2 = counts.begin() + 48, it3;
std::partial_sort(it2, it2 + top_n_digits, it2 + 10);
// sum the first digits
int top = 0;
for(it3 = it2; it3 != (it2 + top_n_digits); ++it3) top += *it3;
// add the rest -> sum all
int div = top;
for(; it3 != (it2 + 10); ++it3) div += *it3;
// return the proportion
return div == 0 ? 1 : top / (double)div;
}
Verification:
Fake_Similarity2 <- function(V, TopNDigits) {
vapply(V, function(v) prop_top_digit(charToRaw(v), TopNDigits), 1)
}
t(rbind(Top1Digit = Fake_Similarity2(v, 1),
Top2Digits = Fake_Similarity2(v, 2),
Top3Digits = Fake_Similarity2(v, 3)))
Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545 1.0000000 1.0000000
twos:22-222222222 1.0000000 1.0000000 1.0000000
34-11111111, ext.123 0.6923077 0.8461538 0.9230769
01012 0.4000000 0.8000000 1.0000000
123-456-789 valid 0.1111111 0.2222222 0.3333333
no digits 1.0000000 1.0000000 1.0000000
1.0000000 1.0000000 1.0000000
NaN 1.0000000 1.0000000 1.0000000
<NA> 1.0000000 1.0000000 1.0000000
Benchmark:
microbenchmark(Fake_Similarity(v, 2), Fake_Similarity2(v, 2))
Unit: microseconds
expr min lq mean median uq max neval cld
Fake_Similarity(v, 2) 298.972 306.0905 328.69384 312.5465 328.108 600.924 100 b
Fake_Similarity2(v, 2) 25.163 27.1495 30.18863 29.1350 30.460 52.975 100 a