Search code examples
rdata-manipulation

R: Encoding and Decoding Data?


I am working with the R programming language.

Consider the following puzzle:

enter image description here

I always wondered if the following was possible - suppose I have this data:

id = c(1,2,3)
names = c("john", "tim", "alex")
country = c("Canada", "England", "Mexico")
my_data = data.frame(id, names, country)

  id names country
1  1  john  Canada
2  2   tim England
3  3  alex  Mexico

I always wondered if it is possible to make a "puzzle" out of this data. For example, is it possible to replace the names and countries with random letters/numbers, and then record what the "conversion key" for this replacement? For example:

# output 1: sample output 
  id names country
1  1  h37ubL  998ggg6ppggg6h8ggg6

# output 2: sample conversion key

   original coded
1         j    h3
2         o     7
3         h    ub
4         n     L
5         C   998
6         a  ggg6
7         n    pp
8         a  ggg6
9         d    h8
10        a  ggg6

Is something like this possible in R - to create a "key" for all possible characters (e.g. letters (capital, small), numbers, accents, dashes, commas, special symbols, etc.) with the exception of spaces? (i.e. a space in the original should be preserved)

Thanks!


Solution

  • It's not difficult to program a substitution cypher in R. Shuffle the alphabet, in the example below the printable ASCII characters, and keep this as the key. Don't forget to not shuffle the space character, ASCII character nr 32.

    Then the same function can encrypt and decrypt by matching the characters in the message with the alphabet and use this match to index the cypher key. Decryption just swaps key and alphabet.

    The code below uses two functions posted to R-bloggers to get the ASCII characters.

    1. The alphabet

    asc <- function(x) { strtoi(charToRaw(x), 16L) }
    chr <- function(n) { rawToChar(as.raw(n)) }
    
    ascii_printable <- 32:126
    ascii_chars <- sapply(ascii_printable, chr)
    data.frame(number = 32:126, char = ascii_chars)
    #>    number char
    #> 1      32     
    #> 2      33    !
    #> 3      34    "
    #> 4      35    #
    #> 5      36    $
    #> 6      37    %
    #> 7      38    &
    #> 8      39    '
    #> 9      40    (
    #> 10     41    )
    #> 11     42    *
    #> 12     43    +
    #> 13     44    ,
    #> 14     45    -
    #> 15     46    .
    #> 16     47    /
    #> 17     48    0
    #> 18     49    1
    #> 19     50    2
    #> 20     51    3
    #> 21     52    4
    #> 22     53    5
    #> 23     54    6
    #> 24     55    7
    #> 25     56    8
    #> 26     57    9
    #> 27     58    :
    #> 28     59    ;
    #> 29     60    <
    #> 30     61    =
    #> 31     62    >
    #> 32     63    ?
    #> 33     64    @
    #> 34     65    A
    #> 35     66    B
    #> 36     67    C
    #> 37     68    D
    #> 38     69    E
    #> 39     70    F
    #> 40     71    G
    #> 41     72    H
    #> 42     73    I
    #> 43     74    J
    #> 44     75    K
    #> 45     76    L
    #> 46     77    M
    #> 47     78    N
    #> 48     79    O
    #> 49     80    P
    #> 50     81    Q
    #> 51     82    R
    #> 52     83    S
    #> 53     84    T
    #> 54     85    U
    #> 55     86    V
    #> 56     87    W
    #> 57     88    X
    #> 58     89    Y
    #> 59     90    Z
    #> 60     91    [
    #> 61     92   \\
    #> 62     93    ]
    #> 63     94    ^
    #> 64     95    _
    #> 65     96    `
    #> 66     97    a
    #> 67     98    b
    #> 68     99    c
    #> 69    100    d
    #> 70    101    e
    #> 71    102    f
    #> 72    103    g
    #> 73    104    h
    #> 74    105    i
    #> 75    106    j
    #> 76    107    k
    #> 77    108    l
    #> 78    109    m
    #> 79    110    n
    #> 80    111    o
    #> 81    112    p
    #> 82    113    q
    #> 83    114    r
    #> 84    115    s
    #> 85    116    t
    #> 86    117    u
    #> 87    118    v
    #> 88    119    w
    #> 89    120    x
    #> 90    121    y
    #> 91    122    z
    #> 92    123    {
    #> 93    124    |
    #> 94    125    }
    #> 95    126    ~
    

    Created on 2022-11-13 with reprex v2.0.2

    2. The cypher function and table

    Note that the space is kept as the first character in the table below, it is not shuffled with sample.

    encrypt <- function(x, key, alphabet = ascii_chars) {
      s <- strsplit(x, "")
      sapply(s, \(y) {
        i <- match(y, alphabet)
        paste(key[i], collapse = "")
      })
    }
    
    set.seed(2022)
    coded <- c(" ", sample(ascii_chars[-1]))
    data.frame(number = 32:126, char = ascii_chars, coded)
    #>    number char coded
    #> 1      32           
    #> 2      33    !     S
    #> 3      34    "     n
    #> 4      35    #     W
    #> 5      36    $     k
    #> 6      37    %     d
    #> 7      38    &     &
    #> 8      39    '     _
    #> 9      40    (     }
    #> 10     41    )     .
    #> 11     42    *     '
    #> 12     43    +     |
    #> 13     44    ,     e
    #> 14     45    -     !
    #> 15     46    .     ~
    #> 16     47    /     @
    #> 17     48    0     a
    #> 18     49    1     =
    #> 19     50    2     ,
    #> 20     51    3    \\
    #> 21     52    4     L
    #> 22     53    5     f
    #> 23     54    6     x
    #> 24     55    7     6
    #> 25     56    8     s
    #> 26     57    9     +
    #> 27     58    :     4
    #> 28     59    ;     ^
    #> 29     60    <     #
    #> 30     61    =     w
    #> 31     62    >     B
    #> 32     63    ?     (
    #> 33     64    @     M
    #> 34     65    A     J
    #> 35     66    B     G
    #> 36     67    C     A
    #> 37     68    D     ?
    #> 38     69    E     3
    #> 39     70    F     X
    #> 40     71    G     r
    #> 41     72    H     Z
    #> 42     73    I     U
    #> 43     74    J     R
    #> 44     75    K     I
    #> 45     76    L     T
    #> 46     77    M     c
    #> 47     78    N     p
    #> 48     79    O     *
    #> 49     80    P     l
    #> 50     81    Q     C
    #> 51     82    R     :
    #> 52     83    S     g
    #> 53     84    T     i
    #> 54     85    U     0
    #> 55     86    V     7
    #> 56     87    W     [
    #> 57     88    X     %
    #> 58     89    Y     z
    #> 59     90    Z     "
    #> 60     91    [     8
    #> 61     92   \\     D
    #> 62     93    ]     1
    #> 63     94    ^     m
    #> 64     95    _     /
    #> 65     96    `     2
    #> 66     97    a     K
    #> 67     98    b     ]
    #> 68     99    c     -
    #> 69    100    d     v
    #> 70    101    e     y
    #> 71    102    f     Y
    #> 72    103    g     P
    #> 73    104    h     q
    #> 74    105    i     `
    #> 75    106    j     h
    #> 76    107    k     Q
    #> 77    108    l     9
    #> 78    109    m     O
    #> 79    110    n     ;
    #> 80    111    o     u
    #> 81    112    p     )
    #> 82    113    q     j
    #> 83    114    r     {
    #> 84    115    s     V
    #> 85    116    t     H
    #> 86    117    u     $
    #> 87    118    v     o
    #> 88    119    w     5
    #> 89    120    x     F
    #> 90    121    y     >
    #> 91    122    z     N
    #> 92    123    {     b
    #> 93    124    |     E
    #> 94    125    }     t
    #> 95    126    ~     <
    

    Created on 2022-11-13 with reprex v2.0.2

    3. Coding the data

    test_with_spaces <- "hello world!"
    (e <- encrypt(test_with_spaces, coded, ascii_chars))
    #> [1] "qy99u 5u{9vS"
    encrypt(e, ascii_chars, coded)
    #> [1] "hello world!"
    
    my_data[-1] <- lapply(my_data[-1], encrypt, key = coded, alphabet = ascii_chars)
    my_data
    #>   id names country
    #> 1  1  huq;  AK;KvK
    #> 2  2   H`O 3;P9K;v
    #> 3  3  K9yF  cyF`-u
    
    my_data[-1] <- lapply(my_data[-1], encrypt, key = ascii_chars, alphabet = coded)
    my_data
    #>   id names country
    #> 1  1  john  Canada
    #> 2  2   tim England
    #> 3  3  alex  Mexico
    

    Created on 2022-11-13 with reprex v2.0.2

    4. An alternative key

    An alternative to enumerating the entire key table is to have a number (a key) set the seed of the pseudo-RNG. After set.seed(key) the function would then generate the vector coded from alphabet. To adapt encrypt to use this variable is straightforward.