Search code examples
c#filefileinfo

c# - Remove old Files from FileInfo list


I have a File-Info-List of more than 200 log-files from a directory. Most of the files need to be in the list, but there are a few lists that should be ignored.

Here is an example of the File-List:

  • A300a1_ContentLink.log
  • A301a20_ContentLink.log
  • A1_4a0_ContentLink.log
  • B200a101_ContentLink.log
  • B200a101_ContentLink_20221208_115905.log
  • B200a101_ContentLink_20221208_115907.log
  • B200a101_ContentLink_20221208_120647.log
  • B201a1_ContentLink.log
  • B202a0_ContentLink.log

Explanation of the file name: The first chars refer to a room (e.g. room A300 or A1). A room could have any description, eg B200, CXS2 or only CDD, the next to a device-name (e.g. device a1 oder device a20). Each device starts with a, followed by 1-3 digits. Last part of each file is "_ContentLink" .

All files with further ending, like _202211208_115905 are duplicates of older versions, that are needed in other programs, but not in my List.

My problem is that I only need the newest File of each logfile in my File-Info-List.

I initialized a FileInfo[] allFiles that contains all of the files of the directory. Next I initialized a new FileInfo[] in which I would like to store only the newest version of each file.

My first attempt was to compare the LastWrite time

            FileInfo currentFile = allFiles[0];

            foreach (FileInfo file in allFiles)
            {
                if (file.LastWriteTime > currentFile.LastWriteTime)
                {
                    currentFile = file;
                }
            }

But I only get back the latest file of the whole folder.

Now, I am thinking about to use Regular Expressions insteadt of .LastWriteTime, to exclude all Files that have a suffix after ContentLink.

But I don't know how and how to remove the outdated files from the list with all files (or transfer only the relevatn to a new File Info[]-List)

Thank you in advance for your ideas.


Solution

  • You can use a LINQ query to:

    • extract the name and time part from each file name
    • group the files by name and
    • select the latest (maximum) file by time

    Something like :

    var regex=new Regex("^(.*?)_ContentLink(.*?).log");
        
    var latest=allFiles.Select(f=>{ 
                                 var parts=regex.Match(f.Name);
                                 return new {
                                     File=f,
                                     Name=parts.Groups[1].ToString(),
                                     Date=parts.Groups[2].ToString()
                                 };
                             })
                  .GroupBy(f=>f.Name)
                  .Select(g=>g.MaxBy(f=>f.Date).File)
                  .ToArray();
    
    foreach(var file in latest)
    {
        Console.WriteLine(file.Name);
    }
    

    This produces

    A300a1_ContentLink.log
    A301a20_ContentLink.log
    A1_4a0_ContentLink.log
    B200a101_ContentLink_20221208_120647.log
    B201a1_ContentLink.log
    B202a0_ContentLink.log
    

    MaxBy was added in .NET 6. Before that you can use the equivalent method from the MoreLINQ library.

    The regular expression captures the smallest possible string before _ContentLink in the first group (.*?) and the smallest possible date part in the second group.

    You could get a bit fancier and use different regular expressions to capture the name and time part. Combined with local functions, this results in a somewhat cleaner query:

        var nameRex=new Regex("^(.*?)_ContentLink.*.log");
        var timeRex=new Regex("^.*_ContentLink(.*?).log");
        
        string NamePart(FileInfo f)
        {
            return nameRex.Match(f.Name).Groups[1].ToString();
        }
    
        string TimePart(FileInfo f)
        {
            return timeRex.Match(f.Name).Groups[1].ToString();
        }
        
        var latest=allFiles
                  .GroupBy(NamePart)
                  .Select(g=>g.MaxBy(TimePart))            
                  .ToArray();