I have this figure in this url Who Is Granted Asylum in the United States?
My goal is to extract the data (as a dataframe) from this figure.
I am not sure how to do that. Here is my attempt
library(rvest)
html_url<-read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
html_url %>% html_elements(xpath = "//*[contains(@class, 'image')]")
but I am getting nowhere.
Extracting some text from images is difficult. As @Wimpel already said, extracting solid data from images or the text in there is very difficult. in addition, how should the code know which kind of chart the figure represents? Sure, there are some digitalization tools for scatter or point based charts like digitize
. But in general, it's better to mine the underlying data directly.
Still, I built this code for your specific example.
library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
image_url <- html_url %>% html_elements("img") %>% html_attr("src")
graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")
# Load and preprocess image
img <- image_read("chart_image.png") %>%
image_resize("800x800") %>%
image_convert(colorspace = "gray")
# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")
text_to_asylum_df <- function(text) {
# Split text into lines
lines <- strsplit(text, "\n")[[1]]
# Filter out empty lines and header/footer
data_lines <- lines[grepl("[0-9]", lines)]
# Extract country and number using regex
asylum_data <- lapply(data_lines, function(line) {
# Extract country (word characters at start of line)
country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
country <- trimws(country)
# Extract number (digits, possibly with comma or period)
number <- gsub("[^0-9,.]", "", line)
number <- gsub(",", "", number)
number <- gsub("\\.", "", number)
number <- as.numeric(number)
return(c(country = country, granted = number))
})
# Convert to dataframe
df <- as.data.frame(do.call(rbind, asylum_data))
# Convert granted column to numeric
df$granted <- as.numeric(as.character(df$granted))
# Add year as attribute
attr(df, "year") <- 2022
return(df)
}
# Create the dataframe
asylum_df <- text_to_asylum_df(text)
# View the result
print(asylum_df)
As you can see, China and Venezuela are not even recognized by tesseract
.
Output:
> print(asylum_df)
country granted
1 asylum in the U 2022
2 El Salvador S TS 2639
3 Guatemala 2329
4 india 22203
5 Honduras 1829
6 Afghanistan 1493
7 turkey 1228
Or for a more solid approach we can use Google's Gemini via API in R, please follow these steps.
Create API Key
button. Copy and
save your API key for future reference.Please note that the Gemini API is currently available for free. In the future, there may be a cost involved in using the Gemini API. Check out the pricing page here.
To install these libraries, you can use the following code in R. There will be a prompt asking for your API-key, please paste it to the console! We will then handover the image to gemini-1.5-flash-latest
and ask it to analyze the chart and give us only comma separated data. We will then read the output with read.csv(textConnection(image_content_csv))
and voilà, there is our dataframe!
install.packages("httr")
install.packages("jsonlite")
Then use the following code to analyze your chart image:
# Load necessary libraries
library(httr)
library(base64enc)
library(jsonlite)
# Read the webpage and find your image as before
figure_url <- "https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg"
# Function
gemini_vision <- function(prompt,
image,
temperature=0.5,
max_output_tokens=4096,
api_key=Sys.getenv("GEMINI_API_KEY"),
model = "gemini-1.5-flash-latest") {
if(nchar(api_key)<1) {
api_key <- readline("Paste your API key here: ")
Sys.setenv(GEMINI_API_KEY = api_key)
}
model_query <- paste0(model, ":generateContent")
response <- POST(
url = paste0("https://generativelanguage.googleapis.com/v1beta/models/", model_query),
query = list(key = api_key),
content_type_json(),
encode = "json",
body = list(
contents = list(
parts = list(
list(
text = prompt
),
list(
inlineData = list(
mimeType = "image/png",
data = base64encode(image)
)
)
)
),
generationConfig = list(
temperature = temperature,
maxOutputTokens = max_output_tokens
)
)
)
if(response$status_code>200) {
stop(paste("Error - ", content(response)$error$message))
}
candidates <- content(response)$candidates
outputs <- unlist(lapply(candidates, function(candidate) candidate$content$parts))
return(outputs)
}
image_content_csv <- gemini_vision(prompt = "Can you analyze this chart and print out only a comma seperated table of the data with headers, nothing else. Thanks!",
image = figure_url)
df_ai_response <- read.csv(textConnection(image_content_csv))
Which finally gives us:
> df_ai_response
Nationality Count
1 China 4589
2 Venezuela 3691
3 El Salvador 2639
4 Guatemala 2329
5 India 2203
6 Honduras 1829
7 Afghanistan 1493
8 Turkey 1228