Search code examples
rapache-arrow

Error when converting large .txt file to parquet format (apache-arrow)


I'm working with a large .txt file and I'm trying to use arrow to process it. I opened the dataset, inspected it and it looks ok.

Next, I tried to partition it based on a column called COUNTY and convert it to parquet format. However, I'm getting this error:

    pq_path <- "ebird/files"
    data |>
     group_by(COUNTY)  |>
     write_dataset(path = pq_path, format = "parquet")

Error in env_bind0(env, data) : attempt to use zero-length variable name

What am I doing wrong?

Full code:

    data <- open_dataset(
      sources = "ebd_PT_prv_relDec-2022/ebd_PT_prv_relDec-2022.txt", 
      format = "tsv"
    ) 
    
    data |> glimpse()
    
    FileSystemDataset with 1 csv file
    8,896,216 rows x 50 columns
    $ `GLOBAL UNIQUE IDENTIFIER`     <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabOfOrnithology:EB…
    $ `LAST EDITED DATE`      <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021-07-18 11:50:07…
    $ `TAXONOMIC ORDER`               <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 2033, 23151, 7125…
    $ CATEGORY                       <string> "species", "species", "species", "species", "species", "species", "species", "spec…
    $ `TAXON CONCEPT ID`             <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibase-E074D706", "a…
    $ `COMMON NAME`                  <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged Gull", "Gargane…
    $ `SCIENTIFIC NAME`              <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirundo", "Larus mic…
    $ `SUBSPECIES COMMON NAME`       <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Cory's Sh…
    $ `SUBSPECIES SCIENTIFIC NAME`   <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "Calonectr…
    $ `EXOTIC CODE`                  <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ `OBSERVATION COUNT`            <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X…
    $ `BREEDING CODE`                <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ `BREEDING CATEGORY`            <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ `BEHAVIOR CODE`                <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ `AGE/SEX`                      <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ COUNTRY                        <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portugal"…
    $ `COUNTRY CODE`                 <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT"…
    $ STATE                          <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores", "Região Autó…
    $ `STATE CODE`                   <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-11", "PT-11", "…
    $ COUNTY                         <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribeira Grande", "C…
    $ `COUNTY CODE`                  <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-AB", "PT-20-RG"…
    $ `IBA CODE`                       <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `BCR CODE`                       <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `USFWS CODE`                     <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `ATLAS BLOCK`                    <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ LOCALITY                       <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michaels, Azores", "S…
    $ `LOCALITY ID`                  <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L2168609", "L29264…
    $ `LOCALITY TYPE`                <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H", "H", "P", "P…
    $ LATITUDE                       <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.77506, 38.70212, 38…
    $ LONGITUDE                      <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752, -25.459442, -…
    $ `OBSERVATION DATE`        <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04-09, 1957-11-26…
    $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00,       NA,       NA,       NA,       NA,       NA,       NA,       NA,   …
    $ `OBSERVER ID`                  <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr3529974", "obsr35299…
    $ `SAMPLING EVENT IDENTIFIER`    <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", "S125610727", "…
    $ `PROTOCOL TYPE`                <string> "Incidental", "Historical", "Historical", "Historical", "Incidental", "Historical"…
    $ `PROTOCOL CODE`                <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62", "P20", "P20"…
    $ `PROJECT CODE`                 <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIRD", "EBIRD", "E…
    $ `DURATION MINUTES`              <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `EFFORT DISTANCE KM`           <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `EFFORT AREA HA`               <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `NUMBER OBSERVERS`              <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3,…
    $ `ALL SPECIES REPORTED`          <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    $ `GROUP IDENTIFIER`             <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
    $ `HAS MEDIA`                     <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    $ APPROVED                        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
    $ REVIEWED                        <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
    $ REASON                           <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ `TRIP COMMENTS`                <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marshall J. Iliff on…
    $ `SPECIES COMMENTS`             <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra-se taxidermada…
    $ ``                               <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    >

    pq_path <- "ebird/files"
    data |>
     group_by(COUNTY)  |>
     write_dataset(path = pq_path, format = "parquet")


Error in env_bind0(env, data) : attempt to use zero-length variable name

Solution

  • You don't need to use group_by: if you want to partition the data, use the partitioning= (and optionally hive_style=) arguments:

    pq_path <- "mtpq"
    arrow::write_dataset(mtcars, pq_path, partitioning="cyl", format="parquet")
    list.files("mtpq", recursive = TRUE, full.names = TRUE)
    # [1] "mtpq/cyl=4/part-0.parquet" "mtpq/cyl=6/part-0.parquet" "mtpq/cyl=8/part-0.parquet"
    

    If you want the subdirs to be just "4", etc, then add hive_style=FALSE.

    I think for your data, you can simply use:

    write_dataset(data, pq_path, partitioning = "COUNTY", format = "parquet")