Speed up the calculation of proportion of most frequent digits in a string, in R

I need help with speeding up a function to compute the proportion repeating digits (ignoring any non-digit). The function helps identify fake entries from a user, before running any check-digit verification (if such is even available). Think fake phone number, fake student number, fake checking account number, fake credit card number, fake any identifier, and so on.

The function is a generalization from this post.

Here is what it does. For the specified number of most frequently appearing digits, it computes the proportion of top digits to all digits in a string, ignoring all non-digits. If there are no digits in a string, it returns 1.0. All calculations are done on a vector of strings.

library(microbenchmark)
V = c('(12) 1221-12121,one-twoooooooooo', 'twos:22-222222222', '34-11111111, ext.123', 
        '01012', '123-456-789 valid', 'no digits', '', NaN, NA)

Fake_Similarity = function(V, TopNDigits) {
    vapply(V, function(v) {
        freq = sort(tabulate(as.integer(charToRaw(v)))[48:57], decreasing = T);
        ratio = sum(freq[1:TopNDigits], na.rm = T) / sum(freq, na.rm = T)
        if (is.nan(ratio)) ratio = 1
        ratio
    },
    double(1))
}

t(rbind(Top1Digit = Fake_Similarity(v, 1), Top2Digits = Fake_Similarity(v, 2), Top3Digits = Fake_Similarity(v, 3)))

microbenchmark(Fake_Similarity(v, 2))

with the output. The labels are not important, but the order ratios must match the original order of corresponding strings.

                                 Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
twos:22-222222222                1.0000000  1.0000000  1.0000000
34-11111111, ext.123             0.6923077  0.8461538  0.9230769
01012                            0.4000000  0.8000000  1.0000000
123-456-789 valid                0.1111111  0.2222222  0.3333333
no digits                        1.0000000  1.0000000  1.0000000
                                 1.0000000  1.0000000  1.0000000
NaN                              1.0000000  1.0000000  1.0000000
<NA>                             1.0000000  1.0000000  1.0000000
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 Fake_Similarity(v, 2) 1.225418 1.283113 1.305139 1.292755 1.304262 1.769703   100

For example, twos:22-222222222 has 11 digits and all of them are the same. So, for the Top1Digit we have 11/11=1, for the Top2Digits we have (11+0)/11=1 again, and so on. In other words, this is a fake number by any measure. It is highly unlikely for, let's say, a person's phone number to have identical digits, including the area code.

Solution

You can use this Rcpp function:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double prop_top_digit(const RawVector& x, int top_n_digits) {

  // counts occurence of each character
  IntegerVector counts(256);
  RawVector::const_iterator it;
  for(it = x.begin(); it != x.end(); ++it) counts[*it]--;

  // partially sort first top_n_digits (negative -> decreasing)
  IntegerVector::iterator it2 = counts.begin() + 48, it3;
  std::partial_sort(it2, it2 + top_n_digits, it2 + 10);

  // sum the first digits
  int top = 0;
  for(it3 = it2; it3 != (it2 + top_n_digits); ++it3) top += *it3;

  // add the rest -> sum all
  int div = top;
  for(; it3 != (it2 + 10); ++it3) div += *it3;

  // return the proportion
  return div == 0 ? 1 : top / (double)div;
}

Verification:

Fake_Similarity2 <- function(V, TopNDigits) {
  vapply(V, function(v) prop_top_digit(charToRaw(v), TopNDigits), 1)
    }
t(rbind(Top1Digit = Fake_Similarity2(v, 1), 
        Top2Digits = Fake_Similarity2(v, 2), 
        Top3Digits = Fake_Similarity2(v, 3)))
                                 Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
twos:22-222222222                1.0000000  1.0000000  1.0000000
34-11111111, ext.123             0.6923077  0.8461538  0.9230769
01012                            0.4000000  0.8000000  1.0000000
123-456-789 valid                0.1111111  0.2222222  0.3333333
no digits                        1.0000000  1.0000000  1.0000000
                                 1.0000000  1.0000000  1.0000000
NaN                              1.0000000  1.0000000  1.0000000
<NA>                             1.0000000  1.0000000  1.0000000

Benchmark:

microbenchmark(Fake_Similarity(v, 2), Fake_Similarity2(v, 2))
Unit: microseconds
                   expr     min       lq      mean   median      uq     max neval cld
  Fake_Similarity(v, 2) 298.972 306.0905 328.69384 312.5465 328.108 600.924   100   b
 Fake_Similarity2(v, 2)  25.163  27.1495  30.18863  29.1350  30.460  52.975   100  a