Search code examples
rustunzipzipphp-ziparchive

Rust help extracting ZIP contents


I have a complex file read issue....I have a need to read a DOCX file with an embedded file system, extract a ZIP file, and peruse the ZIP file's internal directory to extract the actual files I need. I already have written this code in Java successfully, so I know it can be accomplished. But, I want to do this in Rust.

Currently, I can read the DOCX file, iterate through the OLE10 objects to locate the file I need. The OLE10 file (which is actually the ZIP) has a weird extraction command header of 256 bytes, which I seek past. If I read the rest of the file stream and write it to the filesystem it will write out as a ZIP. I can use 7-zip to open the file and see all the contents.

The problem is, no matter what Rust ZIP crate I use (zip, zip_extract, zip_extensions, rc-zip) I just cannot extract the ZIP contents. I continuously run into an issue "cannot find end of central directory". I have iterated through the file, and the EOCD tag of "50 4B 05 06" is actually there. If I end the stream at the EOCD, I got an "early end of file exit" error. The file is >9M, and I am wondering if this might be the issue.

Anyone have any ideas how to use Rust to extract the ZIP directory and attach it to a buffer or the filesystem?

Here's the code that just won't extract:

let docx_path = Path::new(docx_filename);

// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file                                               // Collect the entries of /ObjectPool
    .read_storage(Path::new("/ObjectPool"))
    .unwrap()
    .map(|subdir| comp_file.read_storage(subdir.path().to_owned())
        .unwrap()
        .filter(|path| path.name().contains("Ole10Native"))
        .next()
    )
    .filter(|entry| entry.is_some())                      // Filter entries with data
    .map(|entry| entry.unwrap())                               // Unwrap those entries with data
    .collect();

let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path())  // Create stream of the OLE10 file
    .unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256));                                           // skip the 256 byte header

let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);

let zip_cursor = Cursor::new(ole_buffer);

zip_extract::extract(
    zip_cursor,
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
    false)
    .unwrap();

When I run the following, it writes out the ZIP to the directory and I can extract with 7zip. But, it still panics when trying to extract to the filesystem.

let docx_path = Path::new(docx_filename);

// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file                                               // Collect the entries of /ObjectPool
    .read_storage(Path::new("/ObjectPool"))
    .unwrap()
    .map(|subdir| comp_file.read_storage(subdir.path().to_owned())
        .unwrap()
        .filter(|path| path.name().contains("Ole10Native"))
        .next()
    )
    .filter(|entry| entry.is_some())                      // Filter entries with data
    .map(|entry| entry.unwrap())                               // Unwrap those entries with data
    .collect();

let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path())  // Create stream of the OLE10 file
    .unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256));                                           // skip the 256 byte header

let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);

let zip_cursor = Cursor::new(ole_buffer);    

let mut zip_file = OpenOptions::new()
    .write(true)
    .create(true)
    .open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
zip_file.write_all(&mut zip_cursor.get_ref())?;
zip_file.flush();

let mut zip_file = File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;

let zip_archive = zip::ZipArchive::new(&zip_file)?;

zip_extract::extract(
    zip_file,
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
    false)
    .unwrap();

Solution

  • AWESOME!! I figured it out!! I needed to loop through the file until the 4-byte EOCD end signature of "0x50 0x4B 0x05 0x06", then continue 17 more bytes which provides:

    • "current disk#" (2-bytes),
    • "CD disk#" (2-bytes),
    • "# of CD disk entries on disk" (2-bytes),
    • "total entries of CD" (2-byte),
    • "CD size" (4-bytes),
    • "CD start offset" (4-bytes),
    • "# of bytes for following comments" (2-bytes),
    • comments (# character bytes = previous field)

    I excluded any comments, so my last two fields are 0x00 and 'blank'. Here's the code to build the EOCD signature so I could use extract with zip_extensions::read::zip_extract():

    let mut zip_file = OpenOptions::new()                                                      // Create the output_stream
        .write(true)
        .create(true)
        .open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
    let mut ole_iter = ole10_stream.bytes();
    
    // loop through the ZIP file and write everything until comments
    let mut data: u8;
    let mut output_buffer = Vec::new();
    loop
    {
        match ole_iter.next()
        {
            None => break,
            Some(byte) =>
                    data = byte.unwrap(),
        }
    
        if data == 80                                                                               // look for PK tags
        {
            let mut pk_array = [0u8; 4];
            pk_array[0] = data;
            output_buffer.push(data);
            for pk_idx in 1..4
            {
                pk_array[pk_idx] = match ole_iter.next()
                {
                    None => break,
                    Some(x) =>
                            x.unwrap(),
                };
                output_buffer.push(pk_array[pk_idx]);
            }
    
            if pk_array == [0x50, 0x4B, 0x05, 0x06]                                                           // look for PK EOCD
            {
                for x in 0..18                                                                  // read the next 17 bytes after the EOCD tag
                {
                    data = match ole_iter.next()
                    {
                        None => break,
                        Some(x) =>
                            x.unwrap(),
                    };
                    output_buffer.push(data);
                }
                break;
            }
    
    
        }
        else
        {
            output_buffer.push(data);
        }
    
    }
    zip_file.write(&mut output_buffer);
    zip_file.flush();
    
    
    let zip =  zip::read::ZipArchive::new(
        File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?
    )
        .unwrap();
    
    zip_extensions::read::zip_extract(
        &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip"),
        &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files"),
    );