Search code examples
rstringperformancemaxdigits

Speed up the calculation of proportion of most frequent digits in a string, in R


I need help with speeding up a function to compute the proportion repeating digits (ignoring any non-digit). The function helps identify fake entries from a user, before running any check-digit verification (if such is even available). Think fake phone number, fake student number, fake checking account number, fake credit card number, fake any identifier, and so on.

The function is a generalization from this post.

Here is what it does. For the specified number of most frequently appearing digits, it computes the proportion of top digits to all digits in a string, ignoring all non-digits. If there are no digits in a string, it returns 1.0. All calculations are done on a vector of strings.

library(microbenchmark)
V = c('(12) 1221-12121,one-twoooooooooo', 'twos:22-222222222', '34-11111111, ext.123', 
        '01012', '123-456-789 valid', 'no digits', '', NaN, NA)

Fake_Similarity = function(V, TopNDigits) {
    vapply(V, function(v) {
        freq = sort(tabulate(as.integer(charToRaw(v)))[48:57], decreasing = T);
        ratio = sum(freq[1:TopNDigits], na.rm = T) / sum(freq, na.rm = T)
        if (is.nan(ratio)) ratio = 1
        ratio
    },
    double(1))
}

t(rbind(Top1Digit = Fake_Similarity(v, 1), Top2Digits = Fake_Similarity(v, 2), Top3Digits = Fake_Similarity(v, 3)))

microbenchmark(Fake_Similarity(v, 2))

with the output. The labels are not important, but the order ratios must match the original order of corresponding strings.

                                 Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
twos:22-222222222                1.0000000  1.0000000  1.0000000
34-11111111, ext.123             0.6923077  0.8461538  0.9230769
01012                            0.4000000  0.8000000  1.0000000
123-456-789 valid                0.1111111  0.2222222  0.3333333
no digits                        1.0000000  1.0000000  1.0000000
                                 1.0000000  1.0000000  1.0000000
NaN                              1.0000000  1.0000000  1.0000000
<NA>                             1.0000000  1.0000000  1.0000000
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 Fake_Similarity(v, 2) 1.225418 1.283113 1.305139 1.292755 1.304262 1.769703   100

For example, twos:22-222222222 has 11 digits and all of them are the same. So, for the Top1Digit we have 11/11=1, for the Top2Digits we have (11+0)/11=1 again, and so on. In other words, this is a fake number by any measure. It is highly unlikely for, let's say, a person's phone number to have identical digits, including the area code.


Solution

  • You can use this Rcpp function:

    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    double prop_top_digit(const RawVector& x, int top_n_digits) {
    
      // counts occurence of each character
      IntegerVector counts(256);
      RawVector::const_iterator it;
      for(it = x.begin(); it != x.end(); ++it) counts[*it]--;
    
      // partially sort first top_n_digits (negative -> decreasing)
      IntegerVector::iterator it2 = counts.begin() + 48, it3;
      std::partial_sort(it2, it2 + top_n_digits, it2 + 10);
    
      // sum the first digits
      int top = 0;
      for(it3 = it2; it3 != (it2 + top_n_digits); ++it3) top += *it3;
    
      // add the rest -> sum all
      int div = top;
      for(; it3 != (it2 + 10); ++it3) div += *it3;
    
      // return the proportion
      return div == 0 ? 1 : top / (double)div;
    }
    

    Verification:

    Fake_Similarity2 <- function(V, TopNDigits) {
      vapply(V, function(v) prop_top_digit(charToRaw(v), TopNDigits), 1)
        }
    t(rbind(Top1Digit = Fake_Similarity2(v, 1), 
            Top2Digits = Fake_Similarity2(v, 2), 
            Top3Digits = Fake_Similarity2(v, 3)))
                                     Top1Digit Top2Digits Top3Digits
    (12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
    twos:22-222222222                1.0000000  1.0000000  1.0000000
    34-11111111, ext.123             0.6923077  0.8461538  0.9230769
    01012                            0.4000000  0.8000000  1.0000000
    123-456-789 valid                0.1111111  0.2222222  0.3333333
    no digits                        1.0000000  1.0000000  1.0000000
                                     1.0000000  1.0000000  1.0000000
    NaN                              1.0000000  1.0000000  1.0000000
    <NA>                             1.0000000  1.0000000  1.0000000
    

    Benchmark:

    microbenchmark(Fake_Similarity(v, 2), Fake_Similarity2(v, 2))
    Unit: microseconds
                       expr     min       lq      mean   median      uq     max neval cld
      Fake_Similarity(v, 2) 298.972 306.0905 328.69384 312.5465 328.108 600.924   100   b
     Fake_Similarity2(v, 2)  25.163  27.1495  30.18863  29.1350  30.460  52.975   100  a