Search code examples
r

How to get number of files per higher-level directory from specific sub-directories?


I have a collection of WAV-files in different sub-directories. I would like to get a count of how many WAV-files there are per project, but only from a specific sub-directory, "Files for analysis", a folder that is present within each project directory. What is the best way to go about this?

Each main directory is the name of a recording project. Inside each project directory are two sub-directories, "Files for analysis" and "Backup". Within each of these is a collection of subfolders with the recording rounds, with sub-folders for each recording device. Inside of these are numerous WAV-files. Visually, the folder structure looks like this (with many more projects and WAV-files, the names are just examples):

                                                | Box 1 --- 1.WAV, 2.WAV, 3.WAV
                                  | Round 1 --- | Box 2 --- 1.WAV, 2.WAV, 3.WAV
            | Files for analysis -              
            |                     | Round 2 --- | Box 3 --- 1.WAV, 2.WAV, 3.WAV
            |                                   | Box 4 --- 1.WAV, 2.WAV, 3.WAV
Project 1 --
            |                                   | Box 1 --- 1.WAV, 2.WAV, 3.WAV
            |                     | Round 1 --- | Box 2 --- 1.WAV, 2.WAV, 3.WAV         
            | Backup  ------------                     
                                  | Round 2 --- | Box 3 --- 1.WAV, 2.WAV, 3.WAV
                                                | Box 4 --- 1.WAV, 2.WAV, 3.WAV

                                                | Box 5 --- 1.WAV, 2.WAV, 3.WAV
                                  | Round 1 --- | Box 6 --- 1.WAV, 2.WAV, 3.WAV
            | Files for analysis -              
            |                     | Round 2 --- | Box 7 --- 1.WAV, 2.WAV, 3.WAV
            |                                   | Box 8 --- 1.WAV, 2.WAV, 3.WAV
Project 2 --
            |                                   | Box 5 --- 1.WAV, 2.WAV, 3.WAV
            |                     | Round 1 --- | Box 6 --- 1.WAV, 2.WAV, 3.WAV         
            | Backup  ------------                     
                                  | Round 2 --- | Box 7 --- 1.WAV, 2.WAV, 3.WAV
                                                | Box 8 --- 1.WAV, 2.WAV, 3.WAV

On my computer, an example file path to a WAV-file would look like this:

S:/sound_files/2024/R/testfolder/Project 1/Files for analysis/Round 1/Box 1/1.WAV

So far I have cobbled together a script that gives me the number of WAV-files per sub-directory ("box"), but not per project. (I'm not a programmer so apologies in advance for sub-par code!)

main <- "S:/sound_files/2024/R/testfolder"

## List all folders 
dirs <- list.dirs(main, full.names = TRUE, recursive=TRUE)

## List top-level project folders
only_mains <- dirs[lengths(strsplit(dirs, "/")) == 6 ] 

## get folders with "Files for analysis"
dir_files_for_analysis <- dirs[lengths(strsplit(dirs, "/")) == 7 ]
dir_files_for_analysis <- grep("Files for analysis", dir_files_for_analysis, value = TRUE) 

## List all WAV-files in Files for analysis
files <- list.files(dir_files_for_analysis, pattern = ".WAV", recursive = TRUE, full.names = TRUE) 

length(files) ## How many WAV-files total

## get sub-directory folders with files
dir_list <- split(files, dirname(files)) 

files_in_folder <- sapply(dir_list, length)

head(files_in_folder)

If I replace dirname(files) with only_mains the split function just splits the files by the number of project folders, irrespective of which folders the files come from. I have not been able to find a way to extract the directory path for the project directory, only the files' own directory (e.g. "Box 1"), via dirname().

What I get is this:

S:/sound_files/2024/R/testfolder/Project 1/Files for analysis/Round 1/Box 1                                                                                                 
20

S:/sound_files/2024/R/testfolder/Project 1/Files for analysis/Round 2/Box 2                                                                                                              
19

S:/sound_files/2024/R/testfolder/Project 2/Files for analysis/Round 1/Box 3                                                                                                            
20

S:/sound_files/2024/R/testfolder/Project 2/Files for analysis/Round 2/Box 4                                                                                                            
20

The ideal result for this script should look like this:

S:/sound_files/2024/R/testfolder/Project 1   39 files
S:/sound_files/2024/R/testfolder/Project 2   40 files

Solution

  • You may try the following :

    main <- "S:/sound_files/2024/R/testfolder"
    # Get all the parent project directory full path
    all_projects <- list.dirs(main, recursive = FALSE, full.names = TRUE)
    
    # Function to count total number of files from Files for Analysis folder
    count_files_from_folder <- function(folder) {
      length(list.files(paste0(folder, "/Files for Analysis/"), 
                        pattern = ".WAV", recursive = TRUE))
    }
    
    # Count the number of files from each folder
    sapply(all_projects, count_files_from_folder)
    

    In the test structure that I set up to verify my answer it gives a named vector as output.

    #Test/Project 1 Test/Project 2 
    #             4              7 
    

    If you wish to get dataframe as output then you may stack it.

    stack(sapply(all_projects, count_files_from_folder))[2:1]
    
    #             ind values
    #1 Test/Project 1      4
    #2 Test/Project 2      7