r group-by split filenames python-itertools

How to group files in a list based on name?

I have 4 files:

MCD18A1.A2001001.h15v05.061.2020097222704.hdf

MCD18A1.A2001001.h16v05.061.2020097221515.hdf

MCD18A1.A2001002.h15v05.061.2020079205554.hdf

MCD18A1.A2001002.h16v05.061.2020079205717.hdf

And I want to group them by name (date: A2001001 and A2001002) inside a list, something like this:

[[MCD18A1.A2001001.h15v05.061.2020097222704.hdf, MCD18A1.A2001001.h16v05.061.2020097221515.hdf], [MCD18A1.A2001002.h15v05.061.2020079205554.hdf, MCD18A1.A2001002.h16v05.061.2020079205717.hdf]]

I did this using Python, but I don't know how to do with R:

# Seperate files by date
MODIS_files_bydate = [list(i) for _, i in itertools.groupby(MODIS_files, lambda x: x.split('.')[1])]

Solution

Is this what you are looking for?

g <- sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", s)
split(s, g)
#$A2001001
#[1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf"
#[2] "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
#
#$A2001002
#[1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf"
#[2] "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"

regex explained

The regex is divided in three parts.

^[^\\.]*\\.
- ^ first circumflex marks the beginning of the string;
- ^[^\\.] at the beginning, a class negating a dot (the second ^). The dot is a meta-character and, therefore, must be escaped, \\.;
- the sequence with no dots at the beginning repeated zero or more times (*);
- the previous sequence ends with a dot, \\..
([^\\.]+) is a capture group.
- [^\\.] the class with no dots, like above;
- [^\\.]+ repeated at least one time (+).
\\..*$"
- \\. starting with one dot
- \\..*$ any character repeated zero or more times until the end ($).

What sub is replacing is the capture group, what is between parenthesis, by itself, \\1. This discards everything else.

Data

s <- "
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
s <- scan(text = s, what = character())