Search code examples
rgroup-bysplitfilenamespython-itertools

How to group files in a list based on name?


I have 4 files:

MCD18A1.A2001001.h15v05.061.2020097222704.hdf

MCD18A1.A2001001.h16v05.061.2020097221515.hdf

MCD18A1.A2001002.h15v05.061.2020079205554.hdf

MCD18A1.A2001002.h16v05.061.2020079205717.hdf

And I want to group them by name (date: A2001001 and A2001002) inside a list, something like this:

[[MCD18A1.A2001001.h15v05.061.2020097222704.hdf, MCD18A1.A2001001.h16v05.061.2020097221515.hdf], [MCD18A1.A2001002.h15v05.061.2020079205554.hdf, MCD18A1.A2001002.h16v05.061.2020079205717.hdf]]

I did this using Python, but I don't know how to do with R:

# Seperate files by date
MODIS_files_bydate = [list(i) for _, i in itertools.groupby(MODIS_files, lambda x: x.split('.')[1])]

Solution

  • Is this what you are looking for?

    g <- sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", s)
    split(s, g)
    #$A2001001
    #[1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf"
    #[2] "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
    #
    #$A2001002
    #[1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf"
    #[2] "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
    

    regex explained

    The regex is divided in three parts.

    1. ^[^\\.]*\\.
      • ^ first circumflex marks the beginning of the string;
      • ^[^\\.] at the beginning, a class negating a dot (the second ^). The dot is a meta-character and, therefore, must be escaped, \\.;
      • the sequence with no dots at the beginning repeated zero or more times (*);
      • the previous sequence ends with a dot, \\..
    2. ([^\\.]+) is a capture group.
      • [^\\.] the class with no dots, like above;
      • [^\\.]+ repeated at least one time (+).
    3. \\..*$"
      • \\. starting with one dot
      • \\..*$ any character repeated zero or more times until the end ($).

    What sub is replacing is the capture group, what is between parenthesis, by itself, \\1. This discards everything else.


    Data

    s <- "
    MCD18A1.A2001001.h15v05.061.2020097222704.hdf
    MCD18A1.A2001001.h16v05.061.2020097221515.hdf
    MCD18A1.A2001002.h15v05.061.2020079205554.hdf
    MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
    s <- scan(text = s, what = character())