Search code examples
rdataframeplotcategoriestrend

R: Color scatterplot using a variable from a different data frame


I have two dataframes:

farm_production <- data.frame (
  year = c(seq(1980,2000)),
  "n11" = c(seq(80,200,length.out=21)),
  "n26" = c(seq(110,180,length.out=21)),
  "n31" = c(seq(150,56,length.out=21)),
  "n48" = c(seq(200,160,length.out=21)),
  "n59" = c(seq(198,170,length.out=21)))

farm_info <- data.frame (
  ID = c("n11", "n26", "n31", "n48", "n59"),
  type = c("wheat", "wheat", "cereal", "hay", "hay"),
  country = c("Spain", "Greece", "Italy", "Spain", "Portugal"))

These two dataframes have in common cells with the same value (n11, n26, n31, n48, n59)

I plotted the production of these 5 farms over the years:

plot(farm_production$year, farm_production$n11, xlab = "Year", ylab = "Forage production (tons)", ylim = c(0, 200))
points(farm_production$year, farm_production$n26)
points(farm_production$year, farm_production$n31)
points(farm_production$year, farm_production$n48)
points(farm_production$year, farm_production$n59)

However, I want to color these points by "type" (3 levels: wheat, grain, hay), but this info is in the "farm_info" dataframe, how can I relate the info of one dataframe to another?

I am aware that I can probably do this manually, but keep in mind that this is just a small sample of a much larger dataframe with more than 100 rows and columns, so I am interested in finding a way to "automate" this process by relating the info in dataframe 1 (farm_production) to dataframe 2 (farm_info) to color these points by "type".

Any suggestions on how I can do this? Any help is greatly appreciated.


Solution

  • Having the data in this "wide" format will make plotting difficult.

    I would start by transforming your farm_production dataframe to a tidy format and then join your farm_info data to create a single dataframe from which to plot.

    During the data preparation, I would convert your type variable to a factor so that R might automatically assign colors.

    Optionally, you might consider adding a legend.

    farm_production <- data.frame (
      year = c(seq(1980,2000)),
      "n11" = c(seq(80,200,length.out=21)),
      "n26" = c(seq(110,180,length.out=21)),
      "n31" = c(seq(150,56,length.out=21)),
      "n48" = c(seq(200,160,length.out=21)),
      "n59" = c(seq(198,170,length.out=21)))
    
    farm_info <- data.frame (
      ID = c("n11", "n26", "n31", "n48", "n59"),
      type = c("wheat", "wheat", "cereal", "hay", "hay"),
      country = c("Spain", "Greece", "Italy", "Spain", "Portugal"))
    
    data <- merge(
      reshape(
        farm_production,
        varying = names(farm_production)[-1],
        v.names = "production",
        timevar = "farm",
        times = names(farm_production)[-1],
        direction = "long",
        sep = ""
      ),
      farm_info,
      by.x = "farm", 
      by.y = "ID"
    )
    data$id <- NULL
    data$type <- factor(data$type)
    
    plot(
      data$year, 
      data$production, 
      xlab = "Year", 
      ylab = "Forage production (tons)", 
      ylim = c(0, 200),
      col = data$type # R will automatically choose colors for factors
    )
    
    legend(
      x ="topleft",
      legend = levels(data$type), # labels for factor levels
      col = 1:3, # numeric representation of factor levels
      pch = 19,  # optionally change size of points
      cex = .7   # optionally change overall size of legend
    )
    

    Created on 2024-02-27 with reprex v2.1.0