I learned about Huffman Coding
and tried to apply. So I made a very basic text reader that can only open and save files. And wrote a decorator that can be used to compress the text before saving (which uses Huffman Coding
).
There was a bug that I couldn't find and after alot of debugging I figured out that when I compress the text, as a result the character �
may be in the compressed text. For example, the text ',-.:BCINSabcdefghiklmnoprstuvwy
gets compressed to 앐낧淧翵�ဌ䤺큕㈀
.
I figured out that the bug lies in the saving function. When I save the compressed text, it changes every occurence of �
to ?
. For example, when saving 앐낧淧翵�ဌ䤺큕㈀
, I get 앐낧淧翵?ဌ䤺큕㈀
.
When I try to read the saved file to decompress it, I get a different string so the decompression fails.
What makes it more difficult is that the saving function alone works fine, but it doesn't work when using it in my code. the function looks like this:
public void save() throws IOException {
FileWriter fileWriter = new FileWriter(this.filename);
fileWriter.write(this.text);
fileWriter.close();
}
It's confusing that this.text
at the moment of saving is 앐낧淧翵�ဌ䤺큕㈀
yet it saves it as 앐낧淧翵?ဌ䤺큕㈀
.
As I said before, the function works fine when alone, but doesn't work in my code. I couldn't do any thing more that removing as much as possible from my code and and putting it here. Anyways, a breakpoint can be put at the function FileEditor::save
and you'll find that this.text
at the moment of saving is 앐낧淧翵�ဌ䤺큕㈀
and the content of the file is 앐낧淧翵?ဌ䤺큕㈀
.
Code:
FileEditor
is right below Main
.
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.PriorityQueue;
import java.util.TreeMap;
import static pack.BitsManipulator.CHAR_SIZE_IN_BITS;
public class Main {
public static void main(String[] args) throws IOException {
String text = " ',-.:BCINSabcdefghiklmnoprstuvwy";
FileEditor fileEditor2 = new FileEditor("file.txt");
HuffmanDecorator compressor = new HuffmanDecorator(fileEditor2);
compressor.setText(text);
System.out.println(compressor.getText());
compressor.save();
}
}
class FileEditor implements BasicFileEditor {
private String filename;
private String text;
public FileEditor(String filename) throws IOException {
this.filename = filename;
File file = new File(filename);
StringBuilder builder = new StringBuilder();
if (!file.createNewFile()) {
FileReader reader = new FileReader(file);
int ch;
while ((ch = reader.read()) != -1)
builder.append((char) ch);
}
this.text = builder.toString();
}
@Override
public String getText() {
return text;
}
@Override
public void setText(String text) {
this.text = text;
}
@Override
public void save() throws IOException {
FileWriter fileWriter = new FileWriter(this.filename);
fileWriter.write(this.text);
fileWriter.close();
}
}
interface BasicFileEditor {
String getText();
void setText(String text);
void save() throws IOException;
}
abstract class FileEditorDecorator implements BasicFileEditor {
FileEditor fileEditor;
public FileEditorDecorator(FileEditor fileEditor) {
this.fileEditor = fileEditor;
}
@Override
public String getText() {
return fileEditor.getText();
}
@Override
public void setText(String text) {
fileEditor.setText(text);
}
@Override
public void save() throws IOException {
String oldText = getText();
setText(getModifiedText());
fileEditor.save();
setText(oldText);
}
protected abstract String getModifiedText();
}
class HuffmanDecorator extends FileEditorDecorator {
public HuffmanDecorator(FileEditor fileEditor) {
super(fileEditor);
}
@Override
protected String getModifiedText() {
HuffmanCodingCompressor compressor = new HuffmanCodingCompressor(getText());
return compressor.getCompressedText();
}
}
class HuffmanCodingCompressor {
String text;
public HuffmanCodingCompressor(String text) {
this.text = text;
}
public String getCompressedText() {
EncodingBuilder builder = new EncodingBuilder(text);
return builder.getCompressedText();
}
}
class Node implements Comparable<Node> {
public Node left;
public Node right;
public int value;
public Character character;
public Node(Node left, Node right, int value) {
this(left, right, value, null);
}
public Node(Node left, Node right, int value, Character character) {
this.left = left;
this.right = right;
this.character = character;
this.value = value;
}
@Override
public int compareTo(Node o) {
return this.value - o.value;
}
public boolean isLeafNode() {
return left == null && right == null;
}
Node getLeft() {
if (left == null)
left = new Node(null, null, 0);
return left;
}
Node getRight() {
if (right == null)
right = new Node(null, null, 0);
return right;
}
}
class EncodingBuilder {
private String text;
private Node encodingTree;
private TreeMap<Character, String> encodingTable;
public EncodingBuilder(String text) {
this.text = text;
buildEncodingTree();
buildEncodingTableFromTree(encodingTree);
}
private void buildEncodingTableFromTree(Node encodingTree) {
encodingTable = new TreeMap<>();
buildEncodingTableFromTreeHelper(encodingTree, new StringBuilder());
}
public void buildEncodingTableFromTreeHelper(Node root, StringBuilder key) {
if (root == null)
return;
if (root.isLeafNode()) {
encodingTable.put(root.character, key.toString());
} else {
key.append('0');
buildEncodingTableFromTreeHelper(root.left, key);
key.deleteCharAt(key.length() - 1);
key.append('1');
buildEncodingTableFromTreeHelper(root.right, key);
key.deleteCharAt(key.length() - 1);
}
}
public void buildEncodingTree() {
TreeMap<Character, Integer> freqArray = new TreeMap<>();
for (int i = 0; i < text.length(); i++) {
// improve here.
char c = text.charAt(i);
if (freqArray.containsKey(c)) {
Integer freq = freqArray.get(c) + 1;
freqArray.put(c, freq);
} else {
freqArray.put(c, 1);
}
}
PriorityQueue<Node> queue = new PriorityQueue<>();
for (Character c : freqArray.keySet())
queue.add(new Node(null, null, freqArray.get(c), c));
if (queue.size() == 1)
queue.add(new Node(null, null, 0, '\0'));
while (queue.size() > 1) {
Node n1 = queue.poll();
Node n2 = queue.poll();
queue.add(new Node(n1, n2, n1.value + n2.value));
}
encodingTree = queue.poll();
}
public String getCompressedTextInBits() {
StringBuilder bits = new StringBuilder();
for (int i = 0; i < text.length(); i++)
bits.append(encodingTable.get(text.charAt(i)));
return bits.toString();
}
public String getCompressedText() {
String compressedInBits = getCompressedTextInBits();
int remainder = compressedInBits.length() % CHAR_SIZE_IN_BITS;
int paddingNeededToBeDivisibleByCharSize = CHAR_SIZE_IN_BITS - remainder;
String compressed = BitsManipulator.convertBitsToText(compressedInBits + "0".repeat(paddingNeededToBeDivisibleByCharSize));
return compressed;
}
}
class BitsManipulator {
public static final int CHAR_SIZE_IN_BITS = 16;
public static int bitsInStringToInt(String bits) {
int result = 0;
for (int i = 0; i < bits.length(); i++) {
result *= 2;
result += bits.charAt(i) - '0';
}
return result;
}
public static String convertBitsToText(String bits) {
if (bits.length() % CHAR_SIZE_IN_BITS != 0)
throw new NumberOfBitsNotDivisibleBySizeOfCharException();
StringBuilder result = new StringBuilder();
for (int i = 0; i < bits.length(); i += CHAR_SIZE_IN_BITS)
result.append(asciiInBitsToChar(bits.substring(i, i + CHAR_SIZE_IN_BITS)));
return result.toString();
}
public static char asciiInBitsToChar(String bits) {
return (char) bitsInStringToInt(bits);
}
public static class NumberOfBitsNotDivisibleBySizeOfCharException extends RuntimeException {
}
}
� is the Unicode replacement character U+FFFD. If you encode that in a non-unicode encoding, it will get converted to a regular question mark, as non-unicode encodings can't encode all unicode characters, and this provides a "safety" (i.e. convert everything to question marks that we can't encode).
You seem to be confused about the difference between binary data and text data, leading you to look at compressed data as if it were Korean text instead of binary data. You need to store (and observe) the data as bytes, not chars or Strings.