Search code examples
rggplot2ggalluvial

alluvial diagram in R, Error: Data not in recognizable format


I want to compare two vegetation maps (.shp) using an alluvial diagram. One vegetation map is from 2010, one from 2023. Mapping units in both 2010 and 2023 are the same. Mapping area's however where not the same, so i intersected both maps.

Now i have the data (after some thorough restructuring of the dataframe) in the following format (random subset of 15 rows, total is in original dataset is 9220):

      ID value     total  Jaar
   <int> <chr>     <dbl> <dbl>
 1  1927 H0000  2.33e- 8  2023
 2  1030 H4030  7.64e+ 5  2010
 3  2447 H7120  3.65e- 5  2023
 4   301 H0000  2.47e- 8  2023
 5   611 H0000  2.73e-17  2023
 6  4021 H0000  1.17e+ 5  2010
 7  1531 H0000  3.11e+ 4  2023
 8   759 H0000  2.84e- 4  2010
 9  1339 H6230  6.51e- 7  2010
10  2848 H9999  2.23e- 5  2010
11  1740 H4010A 3.17e- 7  2023
12   335 H4030  5.90e- 5  2023
13  4182 H7120  1.47e- 3  2023
14  2676 H0000  3.81e+ 4  2023
15  2828 H9999  2.89e+ 5  2010

ID = unique id to be able to couple the vegetation in a map-polygon in 2010 to the vegetation in that same polygon in 2023. Up to three different vegetation-types can occur per polygon and it is possible that in 2010 only one type occured in a polygon and in 2023 3 types occured in a polygon. This means that there will be for example one 611 ID in 2010, and three 611 ID's in 2023.

value = vegetation type (habitat type)

total = total surface area in square meters

Jaar = year of mapping

i use the following code to make the alluvial diagram:

 data_alluv_long %>%
   mutate(Jaar = factor(Jaar, levels = c("2010",
                                         "2023")),
          value = factor(value, levels = c("H9999",
                                           "H0000",
                                           "H91D0",
                                           "H7110B",
                                           "H4030",
                                           "H7120",
                                           "H4010A",
                                           "H3160",
                                           "H6230",
                                           "H7150",
                                           "H7110A",
                                           "H2320",
                                           "H0410A",
                                           "H0401A"
                                           ))) %>%
ggplot( 
       aes(x = Jaar, 
           stratum = value, 
           alluvium = ID, 
           y = total)) +
  geom_alluvium(aes(fill = value), 
                alpha = 0.7) +
  geom_stratum() +
  theme_minimal() +
  labs(
    title = "Veranderingen in Habitattypen",
    x = "Jaar",
    y = "Oppervlakte"
  )

I keep getting this error:

Error in `geom_alluvium()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_data()`:
! Data is not in a recognized alluvial form (see `help('alluvial-data')` for details).

As far as i can find out in the 'help'option, my df is in the right format. Some values are extremely small (an artefact from the intersection), so i tried omitting these by adding filter(total > 0.001) but same error occurs.

What i want from the alluvial: two bars, one for 2010 and one for 2023. The fill of the bars must be the vegetation types (habitat types). The flow must show how certain types stayed the same, or how certain times changed during the 13 years.

My question: Where does the error come from and how is my data not in the correct structure?

Link to complete dataset: https://www.dropbox.com/scl/fo/ov0tmhoujvbg4vohsj8td/AGcnP2Cd4Wt3jC4wuIeqYYA?rlkey=aq9lkm6rb0juod4jepkq03vqr&dl=0

Example with (artistic impression) picture: enter image description here


Solution

  • The is_lodes_form() check reveals that you have duplicated ID-axis pairings, which is what I suspected might happen given your earlier description that some polygons have different numbers of vegetation types between years.

    Solution:

    Instead of requiring each ID to appear exactly once in each year, we create a unique flow_id that combines the original ID with an index for each vegetation type within that ID. sol

    Code

    # First, let's load required libraries
    library(tidyverse)
    library(ggalluvial)
    setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory
    
    
    # Read in data from csv
    data_alluv_long <- read.csv("alluv_long.csv")
    
    
    # Ensure each ID appears in both years
    data_complete <- data_alluv_long %>%
      group_by(ID) %>%
      filter(n_distinct(Jaar) == 2) %>%
      ungroup()
    
    is_lodes_form(
      data_complete,
      Jaar,
      value,
      ID
    )
    # is wrong so there is something wrong --> Duplicated id-axis pairings. This is your error
    
    # Fix it
    
    # Function to prepare the data
    prepare_alluvial_data <- function(data) {
      # Step 1: Create a unique identifier for each ID-vegetation combination
      data_prepared <- data %>%
        group_by(ID, Jaar) %>%
        # Create a unique combination identifier
        mutate(veg_index = row_number(),
               # Create a unique flow identifier
               flow_id = paste(ID, veg_index, sep = "_")) %>%
        ungroup()
      
      # Step 2: Ensure the data is properly structured for the alluvial diagram
      data_prepared <- data_prepared %>%
        # Convert Jaar to factor
        mutate(Jaar = factor(Jaar),
               # Ensure value is a factor with specified levels if needed
               value = factor(value))
      
      return(data_prepared)
    }
    
    # Using your data:
    data_ready <- prepare_alluvial_data(data_complete)
    
    
    # Create the plot with the modified data
    ggplot(data_ready,
           aes(x = Jaar, 
               stratum = value, 
               alluvium = flow_id,  # Using the new flow_id instead of ID
               y = total,
               fill = value)) +
      geom_flow(alpha = 0.7) +
      geom_stratum(alpha = 0.8) +
      scale_x_discrete(expand = c(0.1, 0.1)) +
      theme_minimal() +
      labs(
        title = "Veranderingen in Habitattypen",
        x = "Jaar",
        y = "Oppervlakte (m²)"
      ) +
      scale_fill_discrete(name = "Habitattype") +
      theme(legend.position = "right")
    
    # Sanity Check
    
    sum_H0000 <- data_ready %>%
      filter(value == "H0000") %>%
      group_by(Jaar) %>%
      summarise(total_area = sum(total)) %>%
      arrange(Jaar)
    
    sum_H0000$total_area[2]/sum_H0000$total_area[1]
    
    # Check total area by year
    data_alluv_long %>%
      group_by(Jaar) %>%
      summarise(total_area = sum(total))
    

    Output

    out