Search code examples
javabinarybinaryfilesbinary-data

How to store an inverted index in to binary file?


I have a HashMap of terms which contains data of what page the word appeared, the word's frequency in the page, and their positions.

Ex: Word - [page number, word frequency in page, positions in page ]

cat [1, 3, 1, 2, 5 ], [2, 2, 2, 5 ]
dog [2, 2, 1, 7 ]

How would I store this info in a binary file that is easy to read back?

I made the following attempt:

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream out = new DataOutputStream(baos);

        for(String word: invertedIndex.keySet()) {
            out.writeUTF(word);  // Write the word
            for(Entry entry: invertedIndex.get(word)) {  // Info for a page
                out.writeInt(entry.pageNum); // Write its page number
                out.writeInt(entry.wordFrequency); // Write its freq in that page

                for(int position: entry.positions) {
                    out.writeInt(position); // Write the positions
                }
            }
        }

        byte[] bytes = baos.toByteArray();

        FileOutputStream fos = new FileOutputStream(PATH);
        fos.write(bytes);
        fos.close();

Not sure if this is correct... Thanks in advance.

Edit: Thanks, turns out my problem is more on how to decode this rather strictly encode.


Solution

  • Is there a way to preserve this data structure?

    Yea. Lots of ways.

    Hint: Your attempted solution is a good start.

    However a complete solution requires a corresponding method to read the data back. And when you attempt to write a read method that corresponds to your write code, you will discover that there is a systemic problem. For example, there is no easy way to figure out where one list of int values ends and the next one begins.

    There are ways to solve that. Think about it. How can you write two lists one after another so that you know where one ends and the next begins?

    Note: your use of ByteArrayOutputStream is unnecessary. You can write directly to a FileOutputStream wrapped in a BufferedOutputStream.