Search code examples
javaiosetnio

Storing a very large set of numbers in Java


I am trying to store a set of numbers that range from 0 to ~60 billion, where the set starts out empty and gradually becomes denser until it contains every number in the range. The set does not have to be capable of removing numbers. Currently my approach is to represent the set as a very long boolean array and store that array in a text file. I have made a class for this, and have tested both RandomAccessFile and FileChannel with the range of the numbers restricted from 0 to 2 billion, but in both cases the class is much slower at adding and querying numbers than using a regular boolean array. Here is the current state of my class:

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.*;
import java.util.*;
public class FileSet {
    private static final int BLOCK=10_000_000;
    private final long U;
    private final String fname;
    private final FileChannel file;
    public FileSet(long u, String fn) throws IOException {
        U=u;
        fname=fn;
        BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(fname));
        long n=u/8+1;
        for (long rep=0; rep<n/BLOCK; rep++) out.write(new byte[BLOCK]);
        out.write(new byte[(int)(n%BLOCK)]);
        out.close();
        file=new RandomAccessFile(fn,"rw").getChannel();
    }
    public void add(long v) throws IOException {
        if (v<0||v>=U) throw new RuntimeException(v+" out of range [0,"+U+")");
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        file.position(v/8);
        file.write(ByteBuffer.wrap(new byte[] {(byte)(b.get(0)|(1<<(v%8)))}));
    }
    public boolean has(long v) throws IOException {
        if (v<0||v>=U) return false;
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        return ((b.get(0)>>(v%8))&1)!=0;
    }
    public static void main(String[] args) throws IOException {
        long U=2000_000_000;
        SplittableRandom rnd=new SplittableRandom(1);
        List<long[]> actions=new ArrayList<>();
        for (int i=0; i<1000000; i++) actions.add(new long[] {rnd.nextInt(2),rnd.nextLong(U)});

        StringBuilder ret=new StringBuilder(); {
            System.out.println("boolean[]:");
            long st=System.currentTimeMillis();
            boolean[] b=new boolean[(int)U];
            System.out.println("init time="+(System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions)
                if (act[0]==0) b[(int)act[1]]=true;
                else ret.append(b[(int)act[1]]?"1":"0");
            System.out.println("query time="+(System.currentTimeMillis()-st));
        }

        StringBuilder ret2=new StringBuilder(); {
            System.out.println("FileSet:");
            long st=System.currentTimeMillis();
            FileSet fs=new FileSet(U,"FileSet/"+U+"div8.txt");
            System.out.println("init time="+(System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions) {
                if (act[0]==0) fs.add(act[1]);
                else ret2.append(fs.has(act[1])?"1":"0");
            }
            System.out.println("query time="+(System.currentTimeMillis()-st));
            fs.file.close();
        }
        if (!ret.toString().equals(ret2.toString())) System.out.println("MISMATCH");
    }
}

and the output:

boolean[]:
init time=1248
query time=148
FileSet:
init time=269
query time=3014

Additionally, when increasing the range from 2 billion to 10 billion, there is a large jump in total running time for the queries, even though in theory the total running time should stay roughly constant. When I use the class by itself (since a boolean array no longer works for this big of a range), the query time goes from ~3 seconds to ~50 seconds. When I increase the range to 60 billion, the time increases to ~240 seconds. My questions are: is there a faster way of accessing and modifying very large files at arbitrary indices? and is there an entirely different approach to storing large integer sets that is faster than my current approach?


Solution

  • Turns out the simplest solution is to use a 64-bit JVM and increase Java heap space by running my Java program in the terminal with a flag like -Xmx10g. Then I can simply use an array of longs to implicitly store the entire set.