Search code examples
c#.netlistlinq

Filtering a List with other two Lists


Consider the following three example lists:

List<string> localPatientsIDs = new List<string> { "1550615", "1688", "1760654", "1940629", "34277", "48083" };

List<string> remotePatientsIDs = new List<string> { "000-007", "002443", "002446", "214", "34277", "48083" };

List<string> archivedFiles = new List<string>{
    @"G:\Archive\000-007_20230526175817297.zip",
    @"G:\Archive\002443_20230526183639562.zip",
    @"G:\Archive\002446_20230526183334407.zip",
    @"G:\Archive\14967_20240703150011899.zip",
    @"G:\Archive\214_20231213150003676.zip",
    @"G:\Archive\34277_20230526200048891.zip",
    @"G:\Archive\48083_20240214150011919.zip" };

Please note that each element in archivedFiles is the full path of a ZIP file, whose name begins with the patientID that is either in localPatientsIDs or remotePatientsIDs.

For example: @"G:\Archive\000-007_20230526175817297.zip" : the filename 000-007_20230526175817297.zip initiate with 000-007, which is an element in the list remotePatientsIDs.

A patientID connot be at localPatientsIDs and archivedFiles simultaneously, therefore, no duplicates are allowed between these two lists. However, the archivedFiles can contain patientIDs that are also located in remotePatientsIDs.

I need to get the elements in archivedFiles whose file names begin with the elements present in remotePatientsIDs but not present in localPatientsIDs. The endpoint is to Unzip those files to the directory that contains localPatientsIDs database.

For the given example, I would expect to have the following result:

archivedFilesToUnzip == {
    @"G:\Archive\000-007_20230526175817297.zip",
    @"G:\Archive\002443_20230526183639562.zip",
    @"G:\Archive\002446_20230526183334407.zip",
    @"G:\Archive\214_20231213150003676.zip" }

So, how can I use LINQ to do this ?

In my lack of knowledge, I would expect it to be as simple as:

List<string> archivedFilesToUnzip = archivedFiles.Where(name => name.Contains(remotePatients.Except(localPatients)))

I cannot even compile this, since Contains probably is unable to iterate over the List members and I get the message:

CS1503: Argument 1: cannot convert from 'System.Collections.Generic.IEnumerable<string>' to 'string'

Then my best trial so far is the following sentence (I confess it seems a little messy to me). It always returns an empty list.

List<string> archivedFilesToUnzip = archivedFiles.Where(name => archivedFiles.Any(x => x.ToString().Contains(remotePatients.Except(localPatients).ToString()))).ToList();

I've found these helpful posts that helped me to better understand the differences between Where and Select :

Also, I've been looking for any directions using LINQ on :

and other links as well, but I still cannot find a working solution.


Solution

  • C# is statically (and mostly strongly) typed language (see the What is the difference between a strongly typed language and a statically typed language? question and The C# type system article if you want to dive deeper). It means that compiler will check variable types and will not allow a lot of mistakes like comparing string and boolean.

    remotePatients.Except(localPatients) is a collection of string's while name in archivedFiles.Where(name => name is "just" a string. Contains on string can accept either char (a symbol in a string) or another string, not a collection of strings, hence the compilation error.

    Your second attempt compiles, but will not achieve anything meaningful - if you assign remotePatients.Except(localPatients).ToString() to a variable and examine it or print the result to console you will see just the type name (System.Linq.Enumerable+<ExceptIterator>d__991[System.String]` to be exact) which obviously is not part of the file name.

    As for your question, I would suggest to do the following:

    // build the diff hashset for quick lookup for ids to add
    // will improve performance if there are "many" ids
    var missing = remotePatients.Except(localPatients)
        .ToHashSet();
    
    // regular expression to extract id from the file name
    // you can implement this logic without regex if needed
    var regex = new Regex(@"\\(?<id>[\d-]+)_\d+\.zip");
    
    // the result
    List<string> archivedFilesToUnzip = archivedFiles
        .Where(name =>
        {
            var match = regex.Match(name); // check the file name for id
            if (match.Success) // id found
            {
                // extract the id from the file name
                var id = match.Groups["id"].Value; 
                return missing.Contains(id); // check if it should be added
            }
    
            // failed to match pattern for id
            // probably can throw error here to fix the pattern or check the file name
            return false;
        })
        .ToList();
    

    This uses regular expression to extract id from the file name and then search it in the "missing" ids.

    Explanation for this particular regular expression can be found @regex101.