java java.util.scanner bufferedwriter null-character

Java BufferedWriter Creating Null Characters

I've been using Java's BufferedWriter to write to a file to parse out some input. When I open the file after, however, there seems to be added null characters. I tried specifying the encoding as "US-ASCII" and "UTF8" but I get the same result. Here's my code snippet:

Scanner fileScanner = new Scanner(original);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "US-ASCII"));
while(fileScanner.hasNextLine())
  {
     String next = fileScanner.nextLine();
     next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
     out.write(next);
     out.newLine();
  }
 out.flush();
 out.close();

Maybe the issue isn't even with the BufferedWriter?

I've narrowed it down to this code block because if I comment it out, there are no null-characters in the output file. If I do a regex replace in VIM the file is null-character free (:%s/.*^L//g).

Let me know if you need more information.

Thanks!

EDIT: hexdump of a normal line looks like: 0000000 5349 2a41 3030 202a

But when this code is run the hexdump looks like: 0000000 5330 2a49 4130 202a

I'm not sure why things are getting mixed up.

EDIT: Also, even if the file doesn't match the regex and runs through that block of code, it comes out with null characters.

EDIT: Here's a hexdump of the first few lines of a diff: http://pastie.org/pastes/8964701/text

command was: diff -y testfile.hexdump expectedoutput.hexdump

The rest of the lines are different like the last two.

Solution

EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.

The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".

Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.