Search code examples
rtidyversetesseract

Convert image in to data frame in R


I want convert the image in to data frame. I am able to read the image and make it a character object.

library(tesseract)
eng1 <- tesseract("eng")
text1 <- tesseract::ocr("image.png", engine = eng1)
cat(text1)

class(text1) #character

head(text1)
[1] "* teamiD ~ yearlD ~~ NumsP_ TGS oplayerID Gx GS Wx > Lx ERAX\n\n1 ANA 1997 11162 finlecnot 12 23 ca 8 452\n2 ANA 1997 11162 langsma0t 162 2 ae 8 452\n3 ANA 1997 11162 perismant 162 8 ae 8 452\n4 ANA 1997 11162 cieksja01 162 32 ae 8 452\n5 ANA 1997 11162 hasegshot 162 7 ae 8 452\n6 ANA 1997 11162 sprindeot 162 8 ae 8 452\n7 ANA 1997 11162, mayda02 162 2 ae 8 452\n8 ANA 1997 1162 ilkeot 12 2 ae 8 452\n9 ANA 1997 11162 grosskeot 162 3 ae 8 452\n10 ANA 1997 11162 gubiemao1 162 2 ae 8 452\n11 ANA 1997 11162 watsoalot 12 Be ae 8 452\n12 ANA 1998 2162 sparksto 162 2 8 7 449\n13, ANA 1998 9 162 cicksja0t 162 18 8 7 449\n14. ANA 1998 9162 alivaomot 12 26 85 7 449\n15. ANA 1998 9162 finlecnot 12 Be 85 7 449\n16 ANA 1998 9162 washbja0t 162 1\" 8 7 449\n47 ANA 1998 9162 watsoalot 12 4 85 7 449\n18 ANA 1998 9162 medowja0t 162 14 8 7 449\n"

data frame

I need the data in data frame that this image show.


Solution

  • Your ocr is not a faithful representation of your image, and you risk garbage in/garbage out, but text1 can be sequentially cleaned to eventually be in a data.frame:

    text1_a <- gsub('[[:punct:]]', '', text1)
    test1_b <- gsub('(11)', '\\1 ', text1_a)
    test1_c <- gsub('(91{1})', '\\1 ', test1_b)
    test1_d <- gsub('(21{1})', '\\1 ', test1_c)
    read.table(text=test1_d, col.names=c('teamID', 'yearID', 'NumSP', 'TGS','playerID', 'G.x', 'GS', 'W.x', 'L.x', 'ERA.x'))
       teamID yearID NumSP TGS  playerID G.x GS W.x L.x ERA.x
    1     ANA   1997    11 162 finlecnot  12 23  ca   8   452
    2     ANA   1997    11 162 langsma0t 162  2  ae   8   452
    3     ANA   1997    11 162 perismant 162  8  ae   8   452
    4     ANA   1997    11 162 cieksja01 162 32  ae   8   452
    5     ANA   1997    11 162 hasegshot 162  7  ae   8   452
    6     ANA   1997    11 162 sprindeot 162  8  ae   8   452
    7     ANA   1997    11 162   mayda02 162  2  ae   8   452
    8     ANA   1997    11  62    ilkeot  12  2  ae   8   452
    9     ANA   1997    11 162 grosskeot 162  3  ae   8   452
    10    ANA   1997    11 162 gubiemao1 162  2  ae   8   452
    11    ANA   1997    11 162 watsoalot  12 Be  ae   8   452
    12    ANA   1998    21  62  sparksto 162  2   8   7   449
    13    ANA   1998     9 162 cicksja0t 162 18   8   7   449
    14    ANA   1998    91  62 alivaomot  12 26  85   7   449
    15    ANA   1998    91  62 finlecnot  12 Be  85   7   449
    16    ANA   1998    91  62 washbja0t 162  1   8   7   449
    47    ANA   1998    91  62 watsoalot  12  4  85   7   449
    18    ANA   1998    91  62 medowja0t 162 14   8   7   449
    

    The main challenge here is to arrive at consistent number of columns per row, and much of the ocr garbage remains, with a little more added in, but is a data.frame.

    Just using the setting described in weird data, I get the following improvements in ocr after image cleaning with magick, here just from downloading your embedded image above:

    team_img <- image_read("team_data_WrCDj.png")
    
    > team_mgk <- team_img %>%
    + image_resize('2000x') %>%
    + image_convert(type = 'Grayscale') %>%
    + image_trim(fuzz = 40) %>%
    + image_write(format = 'png', density = '300x300') %>%
    + tesseract::ocr()
    > cat(team_mgk)
    ~ teamID yearlD NumSP TGS playerID G.x GS W.x Lx ERA.x
    
    1 ANA 1997 11 162 finlechO1 162 25 eB 78 4.52
    2 ANA 1997 11 162 langsma01 162 9 fond 78 4.52
    3 ANA 1997 11 162 perisma0t 162 8 & 78 4.52
    4 ANA 1997 11 162 dicksja01 162 32 & 78 4.52
    5 ANA 1997 1 162 hasegsh01 162 7 & 78 4.52
    6 ANA 1997 11 162 sprinded1 162 28 & 78 4.52
    7 ANA 1997 11 162 maydad2 162 2 & 78 4.52
    8 ANA 1997 1 162 hillkeO1 162 12 & 78 4.52
    9 ANA 1997 11 162 grosske01 162 3 & 78 4.52
    10 ANA 1997 11 162 gubicmad1 162 2 4 78 4.52
    11. ANA 1997 11 162 watsoal01 162 is & 78 4.52
    12 ANA 1998 9 162 sparkst01 162 20 65 77 4.49
    13, ANA 1998 9 162 dicksja01 162 18 65 7 4.49
    14 ANA 1998 9 162 olivaom01 162 26 65 7 4.49
    15 ANA 1998 9 162 finlechO1 162 34 65 7 4.49
    16 ANA 1998 9 162 washbja01 162 11 65 7 4.49
    17 ANA 1998 9 162 watsoal01 162 14 65 7 4.49
    18 ANA 1998 9 162 mcdowja01 162 14 65 77 4.49
    

    So, still some GS, W.x, L.x problems, but things are improving. I think you'll become much more of an expert at magick through this process. And with better image, none of the above regex approach is necessary...