This might be related to my previous question (on how to convert "för" to "för")
So I have a file that I create in my code. Right now I create it by the following code:
FileWriter fwOne = new FileWriter(wordIndexPath);
BufferedWriter wordIndex = new BufferedWriter(fwOne);
followed by a few
wordIndex.write(wordBuilder.toString()); //that's a StringBuilder
ending (after a while-loop) with a
wordIndex.close();
Now the problem is later on this file is huge and I want (need) to jump in it without going through the entire file. The seek(long pos)
method of RandomAccessFile
lets me do this.
Here's my problem: The characters in the file I've created seem to be encoded with UTF-8 and the only info I have when I seek is the character-position I want to jump to. seek(long pos)
on the other hand jumps in bytes, so I don't end up in the right place since an UTF-8 character can be more than one byte.
Here's my question: Can I, when I write the file, write it in ISO-8859-15 instead (where a character is a byte)? That way the seek(long pos)
will get me in the right position. Or should I instead try to use an alternative to RandomAccessFile
(is there an alternative where you can jump to a character-position?)
Now first the worrisome. FileWriter and FileReader are old utility classes, that use the default platform settings on that computer. Run elsewhere that code will give a different file, will not be able to read a file from another spot.
ISO-8859-15 is a single byte encoding. But java holds text in Unicode, so it
can combine all scripts. And char
is UTF-16. In general a char index will not be a byte index, but in your case it probably works. But the line break might be one \n
or two \r\n
chars/bytes - platform dependently.
Re
Personally I think UTF-8 is well established, and it is easier to use:
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
string = new String(bytes, StandardCharsets.UTF_8);
That way all special quotes, euro, and so on will always be available.
At least specify the encoding:
Files.newBufferedWriter(file.toPath(), "ISO-8859-15");