Search code examples
phpjsonziptardata-extraction

How to extract all nested tar.gz and zip to a directory in PHP?


I need to extract a tar.gz file in PHP. The file contains many JSON files, tar.gz, zip files, and subdirectories. I need to move only the JSON files to a directory ./Dataset/processing and keep extracting the nested tar.gz and zip to get all the JSON files from there. Those files could also have nested folders/ directories.

The structure is like the following:

origin.tar.gz
 ├───sub1.tar.gz
 │   ├───sub2.tar.gz
 │   ├───├───a.json
 │   ├───├───├───├───├───├───...(unknown depth)
 │   ├───b.json
 │   ├───c.json
 ├───sub3.zip
 │   ├───sub4.tar.gz
 │   ├───├───d.json
 │   ├───├───├───├───├───├───...(unknown depth)
 │   ├───e.json
 │   ├───f.json
 ├───subdirectory
 │   ├───g.json
 ├───h.json
 ├───i.json
 |   ..........
 |   ..........
 |   ..........
 |   many of them

Once it gets extracted ./Dataset will look like this

Dataset/processing
 ├───a.json
 ├───b.json
 ├───c.json
 ├───d.json
 ├───e.json
 ├───f.json
 ├───g.json
 ├───h.json
 ├───i.json
 |   ..........
 |   ..........
 |   ..........
 |   many of them

I know how to extract a tar.gz using PharData in PHP, but it works only at a single level depth. I was thinking if some kind of recursion could make this work for multi-level depth.

$phar = new PharData('origin.tar.gz');
$phar->extractTo('/full/path'); // extract all files in the tar.gz

I have refined my code a bit and tried this, it works for multi-depth but fails when there is a directory(folder or nested folders) that also contains JSON. Can someone help me to extract them as well.

<?php

$path = './';

// Extraction of compressed file
function fun($path) {    
    $array = scandir($path); 
    for ($i = 0; $i < count($array); $i++) {
        if($i == 0 OR $i == 1){continue;}
        else {
            $item = $array[$i];
            $fileExt = explode('.', $item);

            // Getting the extension of the file
            $fileActualExt = strtolower(end($fileExt));
            if(($fileActualExt == 'gz') or ($fileActualExt == 'zip')){
                $pathnew = $path.$item; // Dataset ./data1.tar.gz
                $phar = new PharData($pathnew);
                // Moving the files
                $phar->extractTo($path);
                // Del the files
                unlink($pathnew);
                $i=0;
            }
        }
        $array = scandir($path);


    }
}
fun($path);

// Move only the json to ./dataset(I will add it later)
?>

Thanks in advance.


Solution

  • I solved it after doing a bit of research. This solves the problem.

    There are 3 functions:

    • recursiveScanProtected(): It extracts all the Compressed files
    • scanJSON(): It will scan for JSON files and move them to the processing folder.
    • delete_files(): This function removes everything except the processing folder where have the JSON files, and index.php in the root directory.
    <?php
    
    // Root directory
    $path = './';
    
    // Directory where I want to extract the JSON files
    $path_json = $path.'processing/';
    
    
    // Function to extract all the compressed files
    function recursiveScanProtected($dir, $conn) {
        if($dir != '') {
            $tree = glob(rtrim($dir, '/') . '/*');
            if (is_array($tree)) {
                for ($i = 0; $i < count($tree); $i++) {
                    $file = $tree[$i];
                    if (is_dir($file)) {
                        recursiveScanProtected($file, $conn); // Recursive call if directory
                    } elseif (is_file($file)) {
    
                        $item = $file;
                        $fileExt = explode('.', $item); 
                        // Getting the extension of the file
                        $fileActualExt = strtolower(end($fileExt));
                        // Check if the file is a zip or a tar.gz
                        if(($fileActualExt == 'gz') or ($fileActualExt == 'zip')){
    
                            // Moving the file - Overwriting true
                            $phar->extractTo($dir.$i."/", null, true);
    
                            // Del the compressed file
                            unlink($item);
    
                            recursiveScanProtected($dir.$i, $conn); // Recursive call
                        }
    
                    }
                }
            }
        }
    }
    recursiveScanProtected($path, $conn);
    
    
    // Move the JSON files to processing
    function scanJSON($dir, $path_json) {
        if($dir != '') {
            $tree = glob(rtrim($dir, '/') . '/*');
            if (is_array($tree)) {
                foreach($tree as $file) {
                    if (is_dir($file)) {
                        // Do not scan processing recursively, but all other directories should be scanned
                        if($file != './processing'){
                            scanJSON($file, $path_json);
                        }
                    } elseif (is_file($file)) {
    
                        $ext = pathinfo($file);
    
                        if(strtolower($ext['extension']) == 'json'){
                            // Move the JSON files to processing
                            rename($file, $path_json.$ext['basename']);
                        }
                    }
                }
            }
        }
    }
    
    scanJSON($path, $path_json);
    
    /* 
     * php delete function that deals with directories recursively
     * It deletes everything except ./dataset/processing and index.php
     */
    function delete_files($target) {
    
        if(is_dir($target)){
            $files = glob( $target . '*', GLOB_MARK ); //GLOB_MARK adds a slash to directories returned
            foreach( $files as $file ){
                if($file == './processing/' || $file == './index.php'){
                    continue;
                } else{
                    delete_files( $file );
                }
            }
            if($target != './'){
                rmdir( $target );
            }
        } elseif(is_file($target)) {
            unlink( $target );  
        }
    }
    
    delete_files($path);
    ?>