Convert image in to data frame in R

I want convert the image in to data frame. I am able to read the image and make it a character object.

library(tesseract)
eng1 <- tesseract("eng")
text1 <- tesseract::ocr("image.png", engine = eng1)
cat(text1)

class(text1) #character

head(text1)
[1] "* teamiD ~ yearlD ~~ NumsP_ TGS oplayerID Gx GS Wx > Lx ERAX\n\n1 ANA 1997 11162 finlecnot 12 23 ca 8 452\n2 ANA 1997 11162 langsma0t 162 2 ae 8 452\n3 ANA 1997 11162 perismant 162 8 ae 8 452\n4 ANA 1997 11162 cieksja01 162 32 ae 8 452\n5 ANA 1997 11162 hasegshot 162 7 ae 8 452\n6 ANA 1997 11162 sprindeot 162 8 ae 8 452\n7 ANA 1997 11162, mayda02 162 2 ae 8 452\n8 ANA 1997 1162 ilkeot 12 2 ae 8 452\n9 ANA 1997 11162 grosskeot 162 3 ae 8 452\n10 ANA 1997 11162 gubiemao1 162 2 ae 8 452\n11 ANA 1997 11162 watsoalot 12 Be ae 8 452\n12 ANA 1998 2162 sparksto 162 2 8 7 449\n13, ANA 1998 9 162 cicksja0t 162 18 8 7 449\n14. ANA 1998 9162 alivaomot 12 26 85 7 449\n15. ANA 1998 9162 finlecnot 12 Be 85 7 449\n16 ANA 1998 9162 washbja0t 162 1\" 8 7 449\n47 ANA 1998 9162 watsoalot 12 4 85 7 449\n18 ANA 1998 9162 medowja0t 162 14 8 7 449\n"

I need the data in data frame that this image show.

Solution

Your ocr is not a faithful representation of your image, and you risk garbage in/garbage out, but text1 can be sequentially cleaned to eventually be in a data.frame:

text1_a <- gsub('[[:punct:]]', '', text1)
test1_b <- gsub('(11)', '\\1 ', text1_a)
test1_c <- gsub('(91{1})', '\\1 ', test1_b)
test1_d <- gsub('(21{1})', '\\1 ', test1_c)
read.table(text=test1_d, col.names=c('teamID', 'yearID', 'NumSP', 'TGS','playerID', 'G.x', 'GS', 'W.x', 'L.x', 'ERA.x'))
   teamID yearID NumSP TGS  playerID G.x GS W.x L.x ERA.x
1     ANA   1997    11 162 finlecnot  12 23  ca   8   452
2     ANA   1997    11 162 langsma0t 162  2  ae   8   452
3     ANA   1997    11 162 perismant 162  8  ae   8   452
4     ANA   1997    11 162 cieksja01 162 32  ae   8   452
5     ANA   1997    11 162 hasegshot 162  7  ae   8   452
6     ANA   1997    11 162 sprindeot 162  8  ae   8   452
7     ANA   1997    11 162   mayda02 162  2  ae   8   452
8     ANA   1997    11  62    ilkeot  12  2  ae   8   452
9     ANA   1997    11 162 grosskeot 162  3  ae   8   452
10    ANA   1997    11 162 gubiemao1 162  2  ae   8   452
11    ANA   1997    11 162 watsoalot  12 Be  ae   8   452
12    ANA   1998    21  62  sparksto 162  2   8   7   449
13    ANA   1998     9 162 cicksja0t 162 18   8   7   449
14    ANA   1998    91  62 alivaomot  12 26  85   7   449
15    ANA   1998    91  62 finlecnot  12 Be  85   7   449
16    ANA   1998    91  62 washbja0t 162  1   8   7   449
47    ANA   1998    91  62 watsoalot  12  4  85   7   449
18    ANA   1998    91  62 medowja0t 162 14   8   7   449

The main challenge here is to arrive at consistent number of columns per row, and much of the ocr garbage remains, with a little more added in, but is a data.frame.

Just using the setting described in weird data, I get the following improvements in ocr after image cleaning with magick, here just from downloading your embedded image above:

team_img <- image_read("team_data_WrCDj.png")

> team_mgk <- team_img %>%
+ image_resize('2000x') %>%
+ image_convert(type = 'Grayscale') %>%
+ image_trim(fuzz = 40) %>%
+ image_write(format = 'png', density = '300x300') %>%
+ tesseract::ocr()
> cat(team_mgk)
~ teamID yearlD NumSP TGS playerID G.x GS W.x Lx ERA.x

1 ANA 1997 11 162 finlechO1 162 25 eB 78 4.52
2 ANA 1997 11 162 langsma01 162 9 fond 78 4.52
3 ANA 1997 11 162 perisma0t 162 8 & 78 4.52
4 ANA 1997 11 162 dicksja01 162 32 & 78 4.52
5 ANA 1997 1 162 hasegsh01 162 7 & 78 4.52
6 ANA 1997 11 162 sprinded1 162 28 & 78 4.52
7 ANA 1997 11 162 maydad2 162 2 & 78 4.52
8 ANA 1997 1 162 hillkeO1 162 12 & 78 4.52
9 ANA 1997 11 162 grosske01 162 3 & 78 4.52
10 ANA 1997 11 162 gubicmad1 162 2 4 78 4.52
11. ANA 1997 11 162 watsoal01 162 is & 78 4.52
12 ANA 1998 9 162 sparkst01 162 20 65 77 4.49
13, ANA 1998 9 162 dicksja01 162 18 65 7 4.49
14 ANA 1998 9 162 olivaom01 162 26 65 7 4.49
15 ANA 1998 9 162 finlechO1 162 34 65 7 4.49
16 ANA 1998 9 162 washbja01 162 11 65 7 4.49
17 ANA 1998 9 162 watsoal01 162 14 65 7 4.49
18 ANA 1998 9 162 mcdowja01 162 14 65 77 4.49

So, still some GS, W.x, L.x problems, but things are improving. I think you'll become much more of an expert at magick through this process. And with better image, none of the above regex approach is necessary...