Search code examples
javaniommap

Java memory mapped binary search


I'm currently trying to find the fastest way of searching a 2GB binary file within java. This is different to my normal problems, as this file is already memory-mapped into the Linux file system using mmap.

The file is question is a binary file, and I need to search it for a fixed four byte string; AXL0

Normally, on smaller files, I'd just buffer it, convert it to a string, and then regex it. However, as this file is already memory-mapped, and is quite large, the idea of re-buffering it seems wrong, and also converting it into a 2GB string seems even more wrongerer...

After some reading, I've come across the Java NIO packages along with FileChannels and MappedByteBuffers, but I'm not entirely sure how to set them up.

I just need to scan the file, from zero to the last byte in the file and locate each instance of the four byte string.

If anyone could offer some advice or input, I'd greatly appreciate it.

Thanks.


Solution

  • Looking at the task abstractly, you can't do anything better than a linear search.

    From there follows it won't likely matter much which API you use to actually perform the search, for simplicity I would simply go with a buffered InputStream, which can be implemented agnostic of the actual data source and has no inherent limit preventing it from working on files larger than 2GB.

    As long as you chose a reasonable buffer size (read: not too tiny) you should get reasonable performance (as in close to the actual I/O speed limit, except maybe for an SSD because your scan may take longer than the actual I/O in that case).

    Edit: Following KISS you get a few lines of code that should do just fine

    public class ScanForByteCombo {
    
        public static List<Long> scanFor(InputStream is, int needle) throws IOException {
            List<Long> foundOffsets = new ArrayList<>();
            InputStream bs = new BufferedInputStream(is, 0x10000);
            int data = 0;
            int b;
            long offset = 0;
            while ((b = bs.read()) != -1) {
                data = (data << 8) | b;
                if (data == needle) 
                    foundOffsets.add(offset);
                ++offset;
            }
            return foundOffsets;
        }
    
        public static void main(String[] argv) {
    
            int needle = ('A' << 24) | ('X' << 16) | ('F' << 8) | '0';
    
            long start = System.currentTimeMillis();
            try (InputStream is = new FileInputStream("your file")) {
                List<Long> found = scanFor(is, needle);
                System.out.println(found);
            } catch (Exception e) {
                e.printStackTrace();
            }
            System.out.println("scan took " + (System.currentTimeMillis() - start) + "ms. Acceptable?");
        }
    
    }
    

    While it looks mighty inefficient you will probably have to go to great lengths to actually improve performance by a noteworthy amount.