Search code examples
javaandroidencodingutf-8chars

How to get rid of "Rogue Chars" in an .txt encoded under UTF-8


My program is reading from a .txt encoded with UTF-8. The reason why I'm using UTF-8 is to handle the characters åäö. The problem I come across is when the lines are read is that there seems to be some "rogue" characters sneaking in to the string which causes problems when I'm trying to store those lines into variables. Here's the code:

public void Läsochlista()
{
    String Content = "";
    String[] Argument = new String[50];
    int index = 0;
    Log.d("steg1", "steg1");
    try{
        InputStream inputstream = openFileInput("text.txt");
        if(inputstream != null)
        {
            Log.d("steg2", "steg2");
            //InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
            //BufferedReader bufferreader = new BufferedReader(inputstreamreader);
            BufferedReader in = new BufferedReader(new InputStreamReader(inputstream, "UTF-8"));
            String reciveString = "";
            StringBuilder stringbuilder = new StringBuilder();

            while ((reciveString = in.readLine()) != null)
            {
                Argument[index] = reciveString;
                index++;
                if(index == 6)
                {
                    Log.d(Argument[0], String.valueOf((Argument[0].length())));
                    AllaPlatser.add(new Platser(Float.parseFloat(Argument[0]), Float.parseFloat(Argument[1]), Integer.parseInt(Argument[2]), Argument[3], Argument[4], Integer.parseInt(Argument[5])));
                    Log.d("En ny plats skapades", Argument[3]);
                    Arrays.fill(Argument, null);
                    index = 0;
                }
            }
            inputstream.close();
            Content = stringbuilder.toString();
        }
    }
    catch (FileNotFoundException e){
        Log.e("Filen", " Hittades inte");
    } catch (IOException e){
        Log.e("Filen", " Ej läsbar");
    }
}

Now, I'm getting the error

Invalid float: "61.193521"

where the line only contains the chars "61.193521". When i print out the length of the string as read within the program, the output shows "10" which is one more character than the string is supposed to contain. The question; How do i get rid of those invisible "Rouge" chars? and why are they there in the first place?


Solution

  • When you save a file as "UTF-8", your editor may be writing a byte-order mark (BOM) at the beginning of the file.

    See if there's an option in your editor to save UTF-8 without the BOM.

    Apparently the BOM is just a pain in the butt: What's different between UTF-8 and UTF-8 without BOM?

    I know you want to be able to have extended characters in your data; however, you may want to pick a different encoding like Latin-1 (ISO 8859-1).

    Or you can just read & discard the first three bytes from the input stream before you wrap it with the reader.