Search code examples
stringtextstreamreader

Reading from a text file with some really weird string length results


I'm trying to read a text file full of Twitter Screen Names and store them in a database. ScreenNames can't be more than 15 characters so one of my checks ensures that the name isn't more than 15 characters.

I've found something really strange going on when I try to upload AmericanExpress.

This is my text file contents:

americanexpress
AmericanExpress‎
AMERICANEXPRESS

And this is my code:

var names = new List<string>();
var badNames = new List<string>();

using (StreamReader reader = new StreamReader(file.InputStream, Encoding.UTF8))
{
    string line;
    while (!reader.EndOfStream)
    {
        line = reader.ReadLine();
        var name = line.ToLower().Trim();

        Debug.WriteLine(line + " " + line.Length + " " + name + " " + name.Length);
        if (name.Length > 15 || string.IsNullOrWhiteSpace(name))
        {
            badNames.Add(name);
            continue;
        }

        if (names.Contains(name))
        {
            continue;
        }

        names.Add(name);
    }
}

The first americanexpress passes the under 15 length test, the second fails, and the third passes. When I debug the code and hover over name during the second loop for AmericanExpress, this is what I get:

enter image description here enter image description here

And this is Debug output:

americanexpress 15 americanexpress 15
AmericanExpress‎ 16 americanexpress‎ 16
AMERICANEXPRESS 15 americanexpress 15

I've counted the characters in AmericanExpress at least 10 times, and I'm pretty sure it's only 15 character.

Does anyone have any idea why Visual Studio is telling me americanexpress.Length = 16?

SOLUTION

name = Regex.Replace(name, @"[^\u0000-\u007F]", string.Empty);


Solution

  • After the s is a character, which is not visible but counts as a char. look at

    name[15]    8206 '‎'
    

    for information about the char 8206 see http://www.fileformat.info/info/unicode/char/200e/index.htm

    possible solution: read only the ASCII values

    var name = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(line.ToLower().Trim()));