Search code examples
rurl

extracting data from a chart in a web


I have this figure in this url Who Is Granted Asylum in the United States?

My goal is to extract the data (as a dataframe) from this figure.

I am not sure how to do that. Here is my attempt

  library(rvest)
 html_url<-read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
 html_url %>%  html_elements(xpath = "//*[contains(@class, 'image')]")

but I am getting nowhere.


Solution

  • Extracting some text from images is difficult. As @Wimpel already said, extracting solid data from images or the text in there is very difficult. in addition, how should the code know which kind of chart the figure represents? Sure, there are some digitalization tools for scatter or point based charts like digitize. But in general, it's better to mine the underlying data directly. Still, I built this code for your specific example.

    library(tesseract)
    library(rvest)
    library(dplyr)
    library(tidyr)
    library(tidyverse)
    library(magick)
    library(data.table)
    # Read the webpage
    html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
    
    image_url <- html_url %>% html_elements("img") %>% html_attr("src")
    
    graphics <- image_url[grepl("Infographic", image_url)]
    # Download the image
    download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")
    
    # Load and preprocess image
    img <- image_read("chart_image.png") %>%
      image_resize("800x800") %>%
      image_convert(colorspace = "gray")
    
    # Save processed image and apply OCR
    image_write(img, "processed_image.png")
    text <- tesseract::ocr("processed_image.png")
    
    text_to_asylum_df <- function(text) {
      # Split text into lines
      lines <- strsplit(text, "\n")[[1]]
      
      # Filter out empty lines and header/footer
      data_lines <- lines[grepl("[0-9]", lines)]
      
      # Extract country and number using regex
      asylum_data <- lapply(data_lines, function(line) {
        # Extract country (word characters at start of line)
        country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
        country <- trimws(country)
        
        # Extract number (digits, possibly with comma or period)
        number <- gsub("[^0-9,.]", "", line)
        number <- gsub(",", "", number)
        number <- gsub("\\.", "", number)
        number <- as.numeric(number)
        
        return(c(country = country, granted = number))
      })
      
      # Convert to dataframe
      df <- as.data.frame(do.call(rbind, asylum_data))
      
      # Convert granted column to numeric
      df$granted <- as.numeric(as.character(df$granted))
      
      # Add year as attribute
      attr(df, "year") <- 2022
      
      return(df)
    }
    
    # Create the dataframe
    asylum_df <- text_to_asylum_df(text)
    
    # View the result
    print(asylum_df)
    

    As you can see, China and Venezuela are not even recognized by tesseract.

    Output:

    > print(asylum_df)
               country granted
    1  asylum in the U    2022
    2 El Salvador S TS    2639
    3        Guatemala    2329
    4            india   22203
    5         Honduras    1829
    6      Afghanistan    1493
    7           turkey    1228
    

    Or for a more solid approach we can use Google's Gemini via API in R, please follow these steps.

    1. Step: Get API Key - You can access the Gemini API by visiting this link : Google AI Studio. Once you have access, you can create an API key by clicking on Create API Key button. Copy and save your API key for future reference.
    2. Step: Install the Required Libraries - Before we can start using Gemini AI Model in R, we need to install the necessary libraries. The two libraries we will be using are httr and jsonlite. The "httr" library allows us to post our question and fetch response with Gemini API, while the "jsonlite" library helps to convert R object to JSON format.

    Please note that the Gemini API is currently available for free. In the future, there may be a cost involved in using the Gemini API. Check out the pricing page here.

    To install these libraries, you can use the following code in R. There will be a prompt asking for your API-key, please paste it to the console! We will then handover the image to gemini-1.5-flash-latest and ask it to analyze the chart and give us only comma separated data. We will then read the output with read.csv(textConnection(image_content_csv)) and voilà, there is our dataframe!

    install.packages("httr")
    install.packages("jsonlite")
    

    Then use the following code to analyze your chart image:

    # Load necessary libraries
    library(httr)
    library(base64enc)
    library(jsonlite)
    
    # Read the webpage and find your image as before
    
    figure_url <- "https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg"
    
    # Function
    gemini_vision <- function(prompt, 
                              image,
                              temperature=0.5,
                              max_output_tokens=4096,
                              api_key=Sys.getenv("GEMINI_API_KEY"),
                              model = "gemini-1.5-flash-latest") {
      
      if(nchar(api_key)<1) {
        api_key <- readline("Paste your API key here: ")
        Sys.setenv(GEMINI_API_KEY = api_key)
      }
      
      model_query <- paste0(model, ":generateContent")
      
      response <- POST(
        url = paste0("https://generativelanguage.googleapis.com/v1beta/models/", model_query),
        query = list(key = api_key),
        content_type_json(),
        encode = "json",
        body = list(
          contents = list(
            parts = list(
              list(
                text = prompt
              ),
              list(
                inlineData = list(
                  mimeType = "image/png",
                  data = base64encode(image)
                )
              )
            )
          ),
          generationConfig = list(
            temperature = temperature,
            maxOutputTokens = max_output_tokens
          )
        )
      )
      
      if(response$status_code>200) {
        stop(paste("Error - ", content(response)$error$message))
      }
      
      candidates <- content(response)$candidates
      outputs <- unlist(lapply(candidates, function(candidate) candidate$content$parts))
      
      return(outputs)
      
    }
    
    image_content_csv <- gemini_vision(prompt = "Can you analyze this chart and print out only a comma seperated table of the data with headers, nothing else. Thanks!", 
                  image = figure_url)
    
    df_ai_response <- read.csv(textConnection(image_content_csv))
    

    Which finally gives us:

    > df_ai_response
      Nationality Count
    1       China  4589
    2   Venezuela  3691
    3 El Salvador  2639
    4   Guatemala  2329
    5       India  2203
    6    Honduras  1829
    7 Afghanistan  1493
    8      Turkey  1228