Search code examples
filedata-structureszig

Reading and grouping files from a directory in Zig


I have a folder of images that contains several thousand images in 3 different formats: png, jpg and webp. For instance: boat.png, boat.webp, plane.jpg, plane.png, plane.webp.

I decided to learn Zig and write a program that will go through this directory, and delete all but the smallest file of a certain "basename" (ex: boat or plane).

Keep in mind, yes I tried the AIs. They don't know enough Zig

I have got most of the grouping done but I am worried by the output:

Lets open the ../tests directory
File DSC_0031.webp (fs.File.Kind.file)
Filename DSC_0031.webp
Basename: DSC_0031
File DSC_0031.png (fs.File.Kind.file)
Filename DSC_0031.png
Basename: DSC_0031
File DSC_0025.png (fs.File.Kind.file)
Filename DSC_0025.png
Basename: DSC_0025
File DSC_0025.JPG (fs.File.Kind.file)
Filename DSC_0025.JPG
Basename: DSC_0025
File DSC_0031.JPG (fs.File.Kind.file)
Filename DSC_0031.JPG
Basename: DSC_0031
File DSC_0027.webp (fs.File.Kind.file)
Filename DSC_0027.webp
Basename: DSC_0027
File DSC_0050.webp (fs.File.Kind.file)
Filename DSC_0050.webp
Basename: DSC_0050
File DSC_0007.webp (fs.File.Kind.file)
Filename DSC_0007.webp
Basename: DSC_0007
File DSC_0027.JPG (fs.File.Kind.file)
Filename DSC_0027.JPG
Basename: DSC_0027
File DSC_0033.JPG (fs.File.Kind.file)
Filename DSC_0033.JPG
Basename: DSC_0033
File DSC_0033.png (fs.File.Kind.file)
Filename DSC_0033.png
Basename: DSC_0033
File DSC_0027.png (fs.File.Kind.file)
Filename DSC_0027.png
Basename: DSC_0027
File DSC_0046.webp (fs.File.Kind.file)
Filename DSC_0046.webp
Basename: DSC_0046
File DSC_0026.png (fs.File.Kind.file)
Filename DSC_0026.png
Basename: DSC_0026
File DSC_0032.png (fs.File.Kind.file)
Filename DSC_0032.png
Basename: DSC_0032
File DSC_0032.JPG (fs.File.Kind.file)
Filename DSC_0032.JPG
Basename: DSC_0032
File DSC_0026.JPG (fs.File.Kind.file)
Filename DSC_0026.JPG
Basename: DSC_0026
File DSC_0036.JPG (fs.File.Kind.file)
Filename DSC_0036.JPG
Basename: DSC_0036
File .DS_Store (fs.File.Kind.file)
Filename .DS_Store
Basename: .DS_Store
File DSC_0036.png (fs.File.Kind.file)
Filename DSC_0036.png
Basename: DSC_0036
File DSC_0037.png (fs.File.Kind.file)
Filename DSC_0037.png
Basename: DSC_0037
File DSC_0047.webp (fs.File.Kind.file)
Filename DSC_0047.webp
Basename: DSC_0047
File DSC_0037.JPG (fs.File.Kind.file)
Filename DSC_0037.JPG
Basename: DSC_0037
File DSC_0006.webp (fs.File.Kind.file)
Filename DSC_0006.webp
Basename: DSC_0006
File DSC_0051.webp (fs.File.Kind.file)
Filename DSC_0051.webp
Basename: DSC_0051
File DSC_0026.webp (fs.File.Kind.file)
Filename DSC_0026.webp
Basename: DSC_0026
File DSC_0035.JPG (fs.File.Kind.file)
Filename DSC_0035.JPG
Basename: DSC_0035
File DSC_0035.png (fs.File.Kind.file)
Filename DSC_0035.png
Basename: DSC_0035
File DSC_0034.png (fs.File.Kind.file)
Filename DSC_0034.png
Basename: DSC_0034
File DSC_0034.JPG (fs.File.Kind.file)
Filename DSC_0034.JPG
Basename: DSC_0034
File DSC_0056.webp (fs.File.Kind.file)
Filename DSC_0056.webp
Basename: DSC_0056
File DSC_0053.JPG (fs.File.Kind.file)
Filename DSC_0053.JPG
Basename: DSC_0053
File DSC_0047.JPG (fs.File.Kind.file)
Filename DSC_0047.JPG
Basename: DSC_0047
File DSC_0047.png (fs.File.Kind.file)
Filename DSC_0047.png
Basename: DSC_0047
File DSC_0053.png (fs.File.Kind.file)
Filename DSC_0053.png
Basename: DSC_0053
File DSC_0040.webp (fs.File.Kind.file)
Filename DSC_0040.webp
Basename: DSC_0040
File DSC_0052.png (fs.File.Kind.file)
Filename DSC_0052.png
Basename: DSC_0052
File DSC_0046.png (fs.File.Kind.file)
Filename DSC_0046.png
Basename: DSC_0046
File DSC_0046.JPG (fs.File.Kind.file)
Filename DSC_0046.JPG
Basename: DSC_0046
File DSC_0052.JPG (fs.File.Kind.file)
Filename DSC_0052.JPG
Basename: DSC_0052
File DSC_0044.JPG (fs.File.Kind.file)
Filename DSC_0044.JPG
Basename: DSC_0044
File DSC_0050.JPG (fs.File.Kind.file)
Filename DSC_0050.JPG
Basename: DSC_0050
File DSC_0050.png (fs.File.Kind.file)
Filename DSC_0050.png
Basename: DSC_0050
File DSC_0044.png (fs.File.Kind.file)
Filename DSC_0044.png
Basename: DSC_0044
File DSC_0037.webp (fs.File.Kind.file)
Filename DSC_0037.webp
Basename: DSC_0037
File DSC_0045.png (fs.File.Kind.file)
Filename DSC_0045.png
Basename: DSC_0045
File DSC_0051.png (fs.File.Kind.file)
Filename DSC_0051.png
Basename: DSC_0051
File DSC_0051.JPG (fs.File.Kind.file)
Filename DSC_0051.JPG
Basename: DSC_0051
File DSC_0045.JPG (fs.File.Kind.file)
Filename DSC_0045.JPG
Basename: DSC_0045
File DSC_0041.JPG (fs.File.Kind.file)
Filename DSC_0041.JPG
Basename: DSC_0041
File DSC_0055.JPG (fs.File.Kind.file)
Filename DSC_0055.JPG
Basename: DSC_0055
File DSC_0036.webp (fs.File.Kind.file)
Filename DSC_0036.webp
Basename: DSC_0036
File DSC_0055.png (fs.File.Kind.file)
Filename DSC_0055.png
Basename: DSC_0055
File DSC_0041.png (fs.File.Kind.file)
Filename DSC_0041.png
Basename: DSC_0041
File DSC_0040.png (fs.File.Kind.file)
Filename DSC_0040.png
Basename: DSC_0040
File DSC_0054.png (fs.File.Kind.file)
Filename DSC_0054.png
Basename: DSC_0054
File DSC_0054.JPG (fs.File.Kind.file)
Filename DSC_0054.JPG
Basename: DSC_0054
File DSC_0040.JPG (fs.File.Kind.file)
Filename DSC_0040.JPG
Basename: DSC_0040
File DSC_0056.JPG (fs.File.Kind.file)
Filename DSC_0056.JPG
Basename: DSC_0056
File DSC_0042.JPG (fs.File.Kind.file)
Filename DSC_0042.JPG
Basename: DSC_0042
File DSC_0042.png (fs.File.Kind.file)
Filename DSC_0042.png
Basename: DSC_0042
File DSC_0056.png (fs.File.Kind.file)
Filename DSC_0056.png
Basename: DSC_0056
File DSC_0057.png (fs.File.Kind.file)
Filename DSC_0057.png
Basename: DSC_0057
File DSC_0043.png (fs.File.Kind.file)
Filename DSC_0043.png
Basename: DSC_0043
File DSC_0041.webp (fs.File.Kind.file)
Filename DSC_0041.webp
Basename: DSC_0041
File DSC_0043.JPG (fs.File.Kind.file)
Filename DSC_0043.JPG
Basename: DSC_0043
File DSC_0057.JPG (fs.File.Kind.file)
Filename DSC_0057.JPG
Basename: DSC_0057
File DSC_0057.webp (fs.File.Kind.file)
Filename DSC_0057.webp
Basename: DSC_0057
File DSC_0039.webp (fs.File.Kind.file)
Filename DSC_0039.webp
Basename: DSC_0039
File DSC_0042.webp (fs.File.Kind.file)
Filename DSC_0042.webp
Basename: DSC_0042
File DSC_0054.webp (fs.File.Kind.file)
Filename DSC_0054.webp
Basename: DSC_0054
File DSC_0059.JPG (fs.File.Kind.file)
Filename DSC_0059.JPG
Basename: DSC_0059
File DSC_0035.webp (fs.File.Kind.file)
Filename DSC_0035.webp
Basename: DSC_0035
File DSC_0059.png (fs.File.Kind.file)
Filename DSC_0059.png
Basename: DSC_0059
File DSC_0058.png (fs.File.Kind.file)
Filename DSC_0058.png
Basename: DSC_0058
File DSC_0058.JPG (fs.File.Kind.file)
Filename DSC_0058.JPG
Basename: DSC_0058
File DSC_0058.webp (fs.File.Kind.file)
Filename DSC_0058.webp
Basename: DSC_0058
File DSC_0059.webp (fs.File.Kind.file)
Filename DSC_0059.webp
Basename: DSC_0059
File DSC_0048.JPG (fs.File.Kind.file)
Filename DSC_0048.JPG
Basename: DSC_0048
File DSC_0048.png (fs.File.Kind.file)
Filename DSC_0048.png
Basename: DSC_0048
File DSC_0034.webp (fs.File.Kind.file)
Filename DSC_0034.webp
Basename: DSC_0034
File DSC_0049.png (fs.File.Kind.file)
Filename DSC_0049.png
Basename: DSC_0049
File DSC_0049.JPG (fs.File.Kind.file)
Filename DSC_0049.JPG
Basename: DSC_0049
File DSC_0055.webp (fs.File.Kind.file)
Filename DSC_0055.webp
Basename: DSC_0055
File DSC_0014.webp (fs.File.Kind.file)
Filename DSC_0014.webp
Basename: DSC_0014
File DSC_0043.webp (fs.File.Kind.file)
Filename DSC_0043.webp
Basename: DSC_0043
File DSC_0038.webp (fs.File.Kind.file)
Filename DSC_0038.webp
Basename: DSC_0038
File DSC_0025.webp (fs.File.Kind.file)
Filename DSC_0025.webp
Basename: DSC_0025
File DSC_0005.JPG (fs.File.Kind.file)
Filename DSC_0005.JPG
Basename: DSC_0005
File DSC_0039.JPG (fs.File.Kind.file)
Filename DSC_0039.JPG
Basename: DSC_0039
File DSC_0005.png (fs.File.Kind.file)
Filename DSC_0005.png
Basename: DSC_0005
File DSC_0039.png (fs.File.Kind.file)
Filename DSC_0039.png
Basename: DSC_0039
File DSC_0033.webp (fs.File.Kind.file)
Filename DSC_0033.webp
Basename: DSC_0033
File DSC_0048.webp (fs.File.Kind.file)
Filename DSC_0048.webp
Basename: DSC_0048
File DSC_0038.png (fs.File.Kind.file)
Filename DSC_0038.png
Basename: DSC_0038
File DSC_0038.JPG (fs.File.Kind.file)
Filename DSC_0038.JPG
Basename: DSC_0038
File DSC_0006.JPG (fs.File.Kind.file)
Filename DSC_0006.JPG
Basename: DSC_0006
File DSC_0006.png (fs.File.Kind.file)
Filename DSC_0006.png
Basename: DSC_0006
File DSC_0044.webp (fs.File.Kind.file)
Filename DSC_0044.webp
Basename: DSC_0044
File DSC_0007.png (fs.File.Kind.file)
Filename DSC_0007.png
Basename: DSC_0007
File DSC_0007.JPG (fs.File.Kind.file)
Filename DSC_0007.JPG
Basename: DSC_0007
File DSC_0005.webp (fs.File.Kind.file)
Filename DSC_0005.webp
Basename: DSC_0005
File DSC_0052.webp (fs.File.Kind.file)
Filename DSC_0052.webp
Basename: DSC_0052
File DSC_0053.webp (fs.File.Kind.file)
Filename DSC_0053.webp
Basename: DSC_0053
File DSC_0045.webp (fs.File.Kind.file)
Filename DSC_0045.webp
Basename: DSC_0045
File DSC_0014.JPG (fs.File.Kind.file)
Filename DSC_0014.JPG
Basename: DSC_0014
File DSC_0014.png (fs.File.Kind.file)
Filename DSC_0014.png
Basename: DSC_0014
File DSC_0049.webp (fs.File.Kind.file)
Filename DSC_0049.webp
Basename: DSC_0049
File DSC_0032.webp (fs.File.Kind.file)
Filename DSC_0032.webp
Basename: DSC_0032

   (
   DSC_0031.png
   DSC_0031.JPG

   (
   DSC_0025.JPG

   (
   DSC_0027.JPG
   DSC_0027.png

   (

   (

   (
   DSC_0033.png

   (

   (
   DSC_0026.JPG

   (
   DSC_0032.JPG

   (
   DSC_0036.png

   

   (
   DSC_0037.JPG

   (
DSC_0044
   DSC_0044.webp
   DSC_0045.JPG
   DSC_0059.png
   DSC_0059.webp
   DSC_0044.webp
DSC_0007
   DSC_0007.png
   DSC_0051.png
   DSC_0051.JPG
   DSC_0041.JPG
   DSC_0041.png
   DSC_0041.webp
   DSC_0058.png
   DSC_0058.JPG
   DSC_0058.webp
   DSC_0007.png
   DSC_0007.JPG
DSC_0007
   DSC_0007.JPG
   DSC_0055.JPG
   DSC_0055.png
DSC_0005
   DSC_0005.web
   DSC_0035.png
   DSC_0036.webp
   DSC_0005.webp
DSC_0053
   DSC_0053.web
   DSC_0034.JPG
   DSC_0048.JPG
   DSC_0048.png
   DSC_0048.webp
   DSC_0053.webp
DSC_0014
   DSC_0014.JPG
   DSC_0054.png
   DSC_0054.JPG
   DSC_0054.webp
   DSC_0034.webp
   DSC_0014.JPG
   DSC_0014.png
DSC_0014
   DSC_0014.png
   DSC_0053.png
   DSC_0049.png
   DSC_0049.JPG
DSC_0049
   DSC_0049.web
   DSC_0047.png
   DSC_0040.png
   DSC_0040.JPG
   DSC_0049.webp
DSC_0043
   DSC_0043.webp
   DSC_0042.JPG
   DSC_0042.png
   DSC_0042.webp
   DSC_0043.webp
DSC_0038
   DSC_0038.web
   DSC_0052.JPG
   DSC_0056.JPG
   DSC_0056.png
   DSC_0038.webp
   DSC_0038.png
   DSC_0038.JPG
DSC_0025
   DSC_0025.web
   DSC_0046.JPG
   DSC_0057.png
   DSC_0057.JPG
   DSC_0057.webp
   DSC_0025.webp
DSC_0005
   DSC_0005.png
   DSC_0044.png
   DSC_0043.png
   DSC_0043.JPG
   DSC_0005.JPG
   DSC_0005.png
DSC_0039
   DSC_0039.png
   DSC_0050.png
   DSC_0039.JPG
   DSC_0039.png
DSC_0038
   DSC_0038.png
DSC_0038
   DSC_0038.JPG
DSC_0048
   DSC_0048.webp
DSC_0006
   DSC_0006.JPG
   DSC_0006.JPG
   DSC_0006.png
DSC_0006
   DSC_0006.png
DSC_0032
   DSC_0032.webp
   DSC_0032.webp
DSC_0014
   DSC_0014.webp
DSC_0033
   DSC_0033.webp
DSC_0052
   DSC_0052.webp
DSC_0045
   DSC_0045.webp

Here is the code:

const std = @import("std");
const fs = std.fs;
const path = std.fs.path;
const mem = std.mem;

const fileStruct = struct { baseName: []const u8, files: [][]const u8 };

pub fn main() !void {
    const args = try std.process.argsAlloc(std.heap.page_allocator);
    defer std.process.argsFree(std.heap.page_allocator, args);

    if (args.len < 2) {
        std.debug.print("Usage: {s} <directory>\n", .{args[0]});
        return;
    }

    const dir_path: []u8 = args[1];
    std.debug.print("Lets open the {s} directory\n", .{dir_path});

    // const allocator = std.heap.page_allocator;
    var target_dir = try fs.cwd().openDir(dir_path, .{});
    defer target_dir.close();

    var iter = target_dir.iterate();

    var groupedFiles = std.ArrayList(fileStruct).init(std.heap.page_allocator);

    while (try iter.next()) |entry| {
        // Ensure we only process files
        if (entry.kind != fs.Dir.Entry.Kind.file) continue;

        std.debug.print("File {s} ({})\n", .{ entry.name, entry.kind });

        var baseName = path.basename(entry.name);
        std.debug.print("Filename {s}\n", .{baseName});

        const extension = path.extension(baseName);

        if (extension.len != 0) {
            baseName = baseName[0 .. baseName.len - extension.len];
        }

        std.debug.print("Basename: {s}\n", .{baseName});

        var foundGroup: bool = false;

        for (groupedFiles.items) |*group| {
            if (mem.eql(u8, group.baseName, baseName)) {
                const new_files = try std.heap.page_allocator.alloc([]const u8, group.files.len + 1);
                for (group.files, 0..) |file, i| {
                    new_files[i] = file;
                }
                new_files[group.files.len] = try std.heap.page_allocator.dupe(u8, entry.name);
                group.files = new_files;
                foundGroup = true;
                break;
            }
        }

        if (!foundGroup) {
            const new_files = try std.heap.page_allocator.alloc([]const u8, 1);
            new_files[0] = entry.name;
            const newGroup = fileStruct{ .baseName = baseName, .files = new_files }; // Use & to coerce to slice
            try groupedFiles.append(newGroup);
        }
    }

    for (groupedFiles.items) |item| {
        std.debug.print("{s}\n", .{item.baseName});
        for (item.files) |file| {
            std.debug.print("   {s}\n", .{file});
        }
    }
}

Can someone please explain the output? If you can point me in the right direction for deletion of big files, that'd be helpful too, but first, what is going on?


Solution

  • The variable entry is only valid for a single iteration. This is because the iterator reuses the same memory for each iteration. But you keep pointers to entry.name in the if (!foundGroup) section:

    new_files[0] = entry.name;
    const newGroup = fileStruct{ .baseName = baseName, .files = new_files };
    

    You can fix this by copying the strings into a dedicated memory with allocator.dupe:

    new_files[0] = try std.heap.page_allocator.dupe(u8, entry.name);
    const newGroup = fileStruct{
        .baseName = try std.heap.page_allocator.dupe(u8, baseName),
        .files = new_files
    };
    

    This fixes the output of your program.


    Also, as pointed out in the comments, you don't free memory at the end. You can use the GeneralPurposeAllocator, it'll help you with memory leak detection:

    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.assert(gpa.deinit() == .ok);
    const allocator = gpa.allocator();
    

    To free the memory of the grouped files, you can do something like this:

    var groupedFiles = std.ArrayList(fileStruct).init(allocator);
    defer {
        for (groupedFiles.items) |group| {
            for (group.files) |file| {
                allocator.free(file);
            }
            allocator.free(group.files);
            allocator.free(group.baseName);
        }
        groupedFiles.deinit();
    }
    

    You're also leaking memory here:

    group.files = new_files;
    

    Either do allocator.free(group.files) before the line, or use realloc:

    const new_files = try allocator.realloc(group.files, group.files.len + 1);
    new_files[group.files.len] = try allocator.dupe(u8, entry.name);
    group.files = new_files;
    foundGroup = true;
    break;