Search code examples
rplotrangecorrelation

Find Correlation between two columns that has range data


data

Beginner level question.

I have data like the image above. I want to find the correlation between Height and Longevity.

Smaller breeds of dogs tend to live longer than larger breeds. Is there a way to establish this correlation and show it in plot (preferably with dog breed names as well) in R?

cor function is giving error because the height and longevity data is in range. Am not sure how exactly this can be done. Please help.

Thank you.

Code below to reproduce:

  list(
    Breed = c(
      "Labrador Retriever",
      "German Shepherd",
      "Bulldog",
      "Poodle",
      "Beagle",
      "Chihuahua",
      "Boxer",
      "Golden Retriever",
      "Pug",
      "Rottweiler"
    ),
    Country.of.Origin = c(
      "Canada",
      "Germany",
      "England",
      "France",
      "England",
      "Mexico",
      "Germany",
      "Scotland",
      "China",
      "Germany"
    ),
    Fur.Color = c(
      "Yellow, Black, Chocolate",
      "Black, Tan",
      "White, Red",
      "White, Black, Brown, Apricot",
      "White, Tan, Red, Lemon",
      "Black, Brown, Tan, White",
      "Fawn, Brindle",
      "Golden",
      "Fawn, Black",
      "Black, Tan"
    ),
    Height..in. = c(
      "21-24",
      "22-26",
      "12-16",
      "10-15",
      "13-15",
      "6-9",
      "21-25",
      "21-24",
      "10-14",
      "22-27"
    ),
    Color.of.Eyes = c(
      "Brown",
      "Brown",
      "Brown",
      "Brown, Blue",
      "Brown",
      "Brown, Blue",
      "Brown",
      "Brown",
      "Brown",
      "Brown"
    ),
    Longevity..yrs. = c(
      "10-12",
      "7-10",
      "8-10",
      "12-15",
      "12-15",
      "12-20",
      "10-12",
      "10-12",
      "12-15",
      "8-10"
    ),
    Character.Traits = c(
      "Loyal, friendly, intelligent, energetic, good-natured",
      "Loyal, intelligent, protective, confident, trainable",
      "Loyal, calm, gentle, brave",
      "Intelligent, active, affectionate, hypoallergenic",
      "Curious, friendly, energetic, good-natured",
      "Loyal, energetic, confident, sensitive",
      "Loyal, energetic, intelligent, playful, protective",
      "Intelligent, friendly, kind, loyal, good-natured",
      "Loyal, playful, affectionate, social, charming",
      "Loyal, protective, confident, strong"
    ),
    common_problem1 = c(
      "hip dysplasia",
      "hip dysplasia",
      "skin allergies",
      "hip dysplasia",
      "ear infections",
      "dental problems",
      "hip dysplasia",
      "hip dysplasia",
      "eye problems",
      "hip dysplasia"
    ),
    common_problem2 = c(
      "obesity",
      "elbow dysplasia",
      "respiratory issues",
      "epilepsy",
      "hip dysplasia",
      "eye issues",
      "cancer",
      "cancer",
      "respiratory issues",
      "cancer"
    ),
    common_problem3 = c(
      "ear infections",
      "pancreatitis",
      "obesity",
      "bladder stones",
      "epilepsy",
      "respiratory issues",
      "heart conditions",
      "skin allergies",
      "obesity",
      "elbow dysplasia"
    )
  ),
  row.names = c(NA, 10L),
  class = "data.frame"
))

I tried cor(Height..in., Longevity..yrs.). But it is giving me error. Not sure if this is the exact way to try.


Solution

  • Two options come to my mind regarding your problem, but they are both not optimal. Correlations can only be performed on numerical data. As far as I know, there is no possibility to directly perform a correlation on range data.

    Option 1: Rank correlation

    Spearman correlation or Kendalls Tau can both be used to estimate the relationship between ordinal variables by using their respective rank numbers.
    For the variable Height..in. you have 9 unique values in your dataset, which ranges partly overlap. For the variable Longevity..yrs. you have 5 unique values in your dataset. Again, the ranges partly overlap. Despite the overlapping ranges, it is possible to rank the unique values.

    I created two factors from these variables containing this information and added them to the dataframe. Note that I stored the dataset in the object data so I can reference the variables with the $ operator. If your dataset is called differently, you have to adjust the code accordingly.

    data$factor_Height..in. <- factor(data$Height..in., order = TRUE, 
                                        levels = c("6-9", "10-14", "10-15","12-16", "13-15", "21-24", "21-25", "22-26", "22-27"),
                                        labels = c(1,2,3,4,5,6,7,8,9))
    data$factor_Longevity..yrs. <- factor(data$Longevity..yrs., order = TRUE, 
                                        levels = c("7-10", "8-10", "10-12", "12-15", "12-20"),
                                        labels = c(1,2,3,4,5))
    

    These two factors can then be used to calculate Spearmans rank correlation coefficient and Kendalls rank correlation test.

    cor(as.numeric(data$factor_Height..in.), as.numeric(data$factor_Longevity..yrs.), method ="spearman")
    cor(as.numeric(data$factor_Height..in.), as.numeric(data$factor_Longevity..yrs.), method ="kendall")
    

    Option 2: (Mean) values instead of ranges You could also calculate the mean longevity and mean height values and then calculate the (default) Pearson correlation coefficient.

    mean_Height..in. <- sapply(strsplit(as.character(data$Height..in.) , "-", 
                                      fixed = TRUE), function(x) sum(as.numeric(x)))
    mean_Height..in. <- mean_Height..in. / 2
    mean_Longevity..yrs. <- sapply(strsplit(as.character(data$Longevity..yrs.) , "-", 
                                        fixed = TRUE), function(x) sum(as.numeric(x)))
    mean_Longevity..yrs. <- mean_Longevity..yrs. / 2
    cor(mean_Height..in., mean_Longevity..yrs.)
    

    Both spearman and kendall correlation coefficients of the ranked values and the pearson correlation of the averaged ranges lead to a negative correlation. So as you've expected your data reveals that the greater the dog, the smaller the lifespan.

    Plot the data

    A simple scatter plot can be used to display the relationship. Again, the ranges cannot be used, so we use the ranks instead.

    plot(as.numeric(data$factor_Longevity..yrs.), as.numeric(data$factor_Height..in.))
    text(data$factor_Longevity..yrs.,data$factor_Height..in., data$Breed)
    

    Hope that helps!