I need to parse contents of a epub file and I am trying to see what would be the most efficient way to do it. The epub file may contain images, lot of text and occasionally videos too. Should I go for a FileInputStream or a FileReader?
As epub uses a ZIP archive structure I would propose to handle it as such. Find a small snippet below which list the content of an epub file.
Map<String, String> env = new HashMap<>();
env.put("create", "true");
Path path = Paths.get("foobar.epub");
URI uri = URI.create("jar:" + path.toUri());
FileSystem zipFs = FileSystems.newFileSystem(uri, env);
Path root = zipFs.getPath("/");
Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult visitFile(Path file,
BasicFileAttributes attrs) throws IOException {
print(file);
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult preVisitDirectory(Path dir,
BasicFileAttributes attrs) throws IOException {
print(dir);
return FileVisitResult.CONTINUE;
}
private void print(Path file) throws IOException {
Date lastModifiedTime = new Date(Files.getLastModifiedTime(file).toMillis());
System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n",
lastModifiedTime, Files.size(file), file);
}
});
sample output
01.01.1970 00:59:59 0 /META-INF/
11.02.2015 16:33:44 244 /META-INF/container.xml
11.02.2015 16:33:44 3437 /logo.jpg
...
edit If you only want to extract files based on their names you could do it like shown in this snippet for the visitFile(...)
method.
public FileVisitResult visitFile(Path file,
BasicFileAttributes attrs) throws IOException {
// if the filename inside the epub end with "*logo.jpg"
if (file.endsWith("logo.jpg")) {
// extract the file in directory /tmp/
Files.copy(file, Paths.get("/tmp/",
file.getFileName().toString()));
}
return FileVisitResult.CONTINUE;
}
Depending on how you want to process the files inside the epub you might also have a look on the ZipInputStream
.
try (ZipInputStream in = new ZipInputStream(new FileInputStream("foobar.epub"))) {
for (ZipEntry entry = in.getNextEntry(); entry != null;
entry = in.getNextEntry()) {
System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n",
new Date(entry.getTime()), entry.getSize(), entry.getName());
if (entry.getName().endsWith("logo.jpg")) {
try (FileOutputStream out = new FileOutputStream(entry.getName())) {
// process the file
}
}
}
}
sample output
11.02.2013 16:33:44 244 META-INF/container.xml
11.02.2013 16:33:44 3437 logo.jpg