I have a complex file read issue....I have a need to read a DOCX file with an embedded file system, extract a ZIP file, and peruse the ZIP file's internal directory to extract the actual files I need. I already have written this code in Java successfully, so I know it can be accomplished. But, I want to do this in Rust.
Currently, I can read the DOCX file, iterate through the OLE10 objects to locate the file I need. The OLE10 file (which is actually the ZIP) has a weird extraction command header of 256 bytes, which I seek past. If I read the rest of the file stream and write it to the filesystem it will write out as a ZIP. I can use 7-zip to open the file and see all the contents.
The problem is, no matter what Rust ZIP crate I use (zip, zip_extract, zip_extensions, rc-zip) I just cannot extract the ZIP contents. I continuously run into an issue "cannot find end of central directory". I have iterated through the file, and the EOCD tag of "50 4B 05 06" is actually there. If I end the stream at the EOCD, I got an "early end of file exit" error. The file is >9M, and I am wondering if this might be the issue.
Anyone have any ideas how to use Rust to extract the ZIP directory and attach it to a buffer or the filesystem?
Here's the code that just won't extract:
let docx_path = Path::new(docx_filename);
// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file // Collect the entries of /ObjectPool
.read_storage(Path::new("/ObjectPool"))
.unwrap()
.map(|subdir| comp_file.read_storage(subdir.path().to_owned())
.unwrap()
.filter(|path| path.name().contains("Ole10Native"))
.next()
)
.filter(|entry| entry.is_some()) // Filter entries with data
.map(|entry| entry.unwrap()) // Unwrap those entries with data
.collect();
let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path()) // Create stream of the OLE10 file
.unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256)); // skip the 256 byte header
let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);
let zip_cursor = Cursor::new(ole_buffer);
zip_extract::extract(
zip_cursor,
&PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
false)
.unwrap();
When I run the following, it writes out the ZIP to the directory and I can extract with 7zip. But, it still panics when trying to extract to the filesystem.
let docx_path = Path::new(docx_filename);
// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file // Collect the entries of /ObjectPool
.read_storage(Path::new("/ObjectPool"))
.unwrap()
.map(|subdir| comp_file.read_storage(subdir.path().to_owned())
.unwrap()
.filter(|path| path.name().contains("Ole10Native"))
.next()
)
.filter(|entry| entry.is_some()) // Filter entries with data
.map(|entry| entry.unwrap()) // Unwrap those entries with data
.collect();
let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path()) // Create stream of the OLE10 file
.unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256)); // skip the 256 byte header
let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);
let zip_cursor = Cursor::new(ole_buffer);
let mut zip_file = OpenOptions::new()
.write(true)
.create(true)
.open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
zip_file.write_all(&mut zip_cursor.get_ref())?;
zip_file.flush();
let mut zip_file = File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
let zip_archive = zip::ZipArchive::new(&zip_file)?;
zip_extract::extract(
zip_file,
&PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
false)
.unwrap();
AWESOME!! I figured it out!! I needed to loop through the file until the 4-byte EOCD end signature of "0x50 0x4B 0x05 0x06", then continue 17 more bytes which provides:
I excluded any comments, so my last two fields are 0x00 and 'blank'. Here's the code to build the EOCD signature so I could use extract with zip_extensions::read::zip_extract():
let mut zip_file = OpenOptions::new() // Create the output_stream
.write(true)
.create(true)
.open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
let mut ole_iter = ole10_stream.bytes();
// loop through the ZIP file and write everything until comments
let mut data: u8;
let mut output_buffer = Vec::new();
loop
{
match ole_iter.next()
{
None => break,
Some(byte) =>
data = byte.unwrap(),
}
if data == 80 // look for PK tags
{
let mut pk_array = [0u8; 4];
pk_array[0] = data;
output_buffer.push(data);
for pk_idx in 1..4
{
pk_array[pk_idx] = match ole_iter.next()
{
None => break,
Some(x) =>
x.unwrap(),
};
output_buffer.push(pk_array[pk_idx]);
}
if pk_array == [0x50, 0x4B, 0x05, 0x06] // look for PK EOCD
{
for x in 0..18 // read the next 17 bytes after the EOCD tag
{
data = match ole_iter.next()
{
None => break,
Some(x) =>
x.unwrap(),
};
output_buffer.push(data);
}
break;
}
}
else
{
output_buffer.push(data);
}
}
zip_file.write(&mut output_buffer);
zip_file.flush();
let zip = zip::read::ZipArchive::new(
File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?
)
.unwrap();
zip_extensions::read::zip_extract(
&PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip"),
&PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files"),
);