Search code examples
windowscygwinextract7zipcompression

Extract specific file extensions from multiple 7-zip files


I have a RAR file and a ZIP file. Within these two there is a folder. Inside the folder there are several 7-zip (.7z) files. Inside every 7z there are multiple files with the same extension, but whose names vary.

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

I want to extract just the ones I need from thousands of files... I need those files whose names include a certain substring. For example, if the name of a compressed file includes '[!]' in the name or '(U)' or '(J)' that's the criteria to determine the file to be extracted.

I can extract the folder without problem so I have this structure:

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

I'm in a Windows environment but I have Cygwin installed. I wonder how can I extract the files I need painlessly? Maybe using a single command line line.

Update

There are some improvements to the question:

  • The inner 7z files and their respective files inside them can have spaces in their names.
  • There are 7z files with just one file inside of them that doesn't meet the given criteria. Thus, being the only possible file, they have to be extracted too.

Solution

Thanks to everyone. The bash solution was the one that helped me out. I wasn't able to test Python3 solutions because I had problems trying to install libraries using pip. I don't use Python so I'll have to study and overcome the errors I face with these solutions. For now, I've found a suitable answer. Thanks to everyone.


Solution

  • This solution is based on bash, grep and awk, it works on Cygwin and on Ubuntu.

    Since you have the requirement to search for (X) [!].ext files first and if there are no such files then look for (X).ext files, I don't think it is possible to write some single expression to handle this logic.

    The solution should have some if/else conditional logic to test the list of files inside the archive and decide which files to extract.

    Here is the initial structure inside the zip/rar archive I tested my script on (I made a script to prepare this structure):

    folder
    ├── 7z_1.7z
    │   ├── (E).txt
    │   ├── (J) [!].txt
    │   ├── (J).txt
    │   ├── (U) [!].txt
    │   └── (U).txt
    ├── 7z_2.7z
    │   ├── (J) [b1].txt
    │   ├── (J) [b2].txt
    │   ├── (J) [o1].txt
    │   └── (J).txt
    ├── 7z_3.7z
    │   ├── (E) [!].txt
    │   ├── (J).txt
    │   └── (U).txt
    └── 7z 4.7z
        └── test.txt
    

    The output is this:

    output
    ├── 7z_1.7z           # This is a folder, not an archive
    │   ├── (J) [!].txt   # Here we extracted only files with [!]
    │   └── (U) [!].txt
    ├── 7z_2.7z
    │   └── (J).txt       # Here there are no [!] files, so we extracted (J)
    ├── 7z_3.7z
    │   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
    └── 7z 4.7z
        └── test.txt      # We had only one file here, extracted it
    

    And this is the script to do the extraction:

    #!/bin/bash
    
    # Remove the output (if it's left from previous runs).
    rm -r output
    mkdir -p output
    
    # Unzip the zip archive.
    unzip data.zip -d output
    # For rar use
    #  unrar x data.rar output
    # OR
    #  7z x -ooutput data.rar
    
    for archive in output/folder/*.7z
    do
      # See https://stackoverflow.com/questions/7148604
      # Get the list of file names, remove the extra output of "7z l"
      list=$(7z l "$archive" | awk '
          /----/ {p = ++p % 2; next}
          $NF == "Name" {pos = index($0,"Name")}
          p {print substr($0,pos)}
      ')
      # Get the list of files with [!].
      extract_list=$(echo "$list" | grep "[!]")
      if [[ -z $extract_list ]]; then
        # If we don't have files with [!], then look for ([A-Z]) pattern
        # to get files with single letter in brackets.
        extract_list=$(echo "$list" | grep "([A-Z])\.")
      fi
      if [[ -z $extract_list ]]; then
        # If we only have one file - extract it.
        if [[ ${#list[@]} -eq 1 ]]; then
          extract_list=$list
        fi
      fi
      if [[ ! -z $extract_list ]]; then
        # If we have files to extract, then do the extraction.
        # Output path is output/7zip_archive_name/
        out_path=output/$(basename "$archive")
        mkdir -p "$out_path"
        echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
      fi
    done
    

    The basic idea here is to go over 7zip archives and get the list of files for each of them using 7z l command (list of files).

    The output of the command if quite verbose, so we use awk to clean it up and get the list of file names.

    After that we filter this list using grep to get either a list of [!] files or a list of (X) files. Then we just pass this list to 7zip to extract the files we need.