Search code examples
rread.table

data loss with read.table() in R studio


Struggling with data loss when using read.table in R

I downloaded the entire World Checklist of Vascular Plant Database version 9:

http://sftp.kew.org/pub/data-repositories/WCVP/

unzip the file and to get wcvp_v9_jun_2022.txt and use control + F to search "Corymbia", and you will find many rows of data where genus=="Corymbia", the same is also true for genus=="Eucalyptus" and genus=="Angophora"

imported it into R studio with the following line

WCVP <- read.table("wcvp_v9_jun_2022.txt",sep = "|", fill = T, header = T)

and check for the data

WCVP[WCVP$genus=="Corymbia",]

WCVP[WCVP$genus=="Eucalyptus",]

WCVP[WCVP$genus=="Angophora",]

I got the response

 WCVP[WCVP$genus=="Corymbia",]
 [1] kew_id           family           genus            species         
 [5] infraspecies     taxon_name       authors          rank            
 [9] taxonomic_status accepted_kew_id  accepted_name    accepted_authors
[13] parent_kew_id    parent_name      parent_authors   reviewed        
[17] publication      original_name_id
<0 rows> (or 0-length row.names)

While data for the other 2 genera are intact and R spits out rows of data?

Why is the data for Genus Corymbia missing after the .txt is imported into R studio? is there a bug or how do I troubleshoot?

Many thanks


Solution

  • There are embedded single-quotes (singles, not always paired) in the data that are throwing off reading it in. Set quote="" and you should see all the data.

    WCVP <- read.table("wcvp_v9_jun_2022.txt",
                       sep = "|", fill = TRUE, header = TRUE)
    nrow(WCVP)
    # [1] 605649
    WCVP[WCVP$genus=="Corymbia",]
    #  [1] kew_id           family           genus            species          infraspecies     taxon_name       authors          rank             taxonomic_status accepted_kew_id  accepted_name    accepted_authors parent_kew_id   
    # [14] parent_name      parent_authors   reviewed         publication      original_name_id
    # <0 rows> (or 0-length row.names)
    
    WCVP <- read.table("wcvp_v9_jun_2022.txt",
                       sep = "|", fill = TRUE, header = TRUE, quote = "")
    nrow(WCVP)
    # [1] 1232931                                    ## DIFFERENT!
    
    head(WCVP[WCVP$genus=="Corymbia",], 3)
    #          kew_id    family    genus    species infraspecies          taxon_name                                     authors    rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id parent_name
    # 758307 986238-1 Myrtaceae Corymbia                                    Corymbia                    K.D.Hill & L.A.S.Johnson   GENUS         Accepted                                                                         
    # 758308 986307-1 Myrtaceae Corymbia abbreviata              Corymbia abbreviata (Blakely & Jacobs) K.D.Hill & L.A.S.Johnson SPECIES         Accepted                                                     986238-1    Corymbia
    # 758309 986248-1 Myrtaceae Corymbia  abergiana               Corymbia abergiana         (F.Muell.) K.D.Hill & L.A.S.Johnson SPECIES         Accepted                                                     986238-1    Corymbia
    #                  parent_authors reviewed           publication original_name_id
    # 758307                          Reviewed Telopea 6: 214 (1995)                 
    # 758308 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 344 (1995)         592646-1
    # 758309 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 244 (1995)         592647-1