Search code examples
c#roslynroslyn-code-analysis

When using CSharpCompilation the diagnostics messages report non-breaking spaces in file as error


I have source files with non-breaking spaces. Sometimes it is just 0xA0, but sometimes it is 0xC2 0xA0 (as a pair). When I parse these files and feed to CSharpCompilation it returns bag diagnostics messages like this:

c:\xyz\SomeFile.cs(8,1): error CS1056: Unexpected character '�'

Here is how I compile the code:

private static readonly List<KeyValuePair<string, ReportDiagnostic>> s_specificDiagnosticOptions = new[]
{
    // Assembly 'AssemblyName1' uses 'TypeName' which has a higher version than referenced assembly 'AssemblyName2'
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1705
    "CS1705",
    // Assuming assembly reference "Assembly Name #1" matches "Assembly Name #2", you may need to supply runtime policy
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1701
    "CS1701",
    // Assuming assembly reference "Assembly Name #1" used by "Type Name #1" matches identity "Assembly Name #2" of "Type Name #2", you may need to supply runtime policy
    "CS1702",
    // 'member1' hides inherited member 'member2'. Use the new keyword if hiding was intended
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/cs0108
    "CS0108",
    // The member 'member' does not hide an inherited member. The new keyword is not required
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0109
    "CS0109",
    // 'function1' hides inherited member 'function2'. To make the current method override that implementation, add the override keyword. Otherwise add the new keyword.
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0114
    "CS0114",
    // The result of the expression is always 'value1' since a value of type 'value2' is never equal to 'null' of type 'value3'
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0472
    "CS0472",
    // 'class' overrides Object.Equals(object o) but does not override Object.GetHashCode()
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0659
    "CS0659",
    // Unreachable code detected
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0162
    "CS0162",
    // Invalid name for a preprocessing symbol; '' is not a valid identifier
    "CS8301",
    // The annotation for nullable reference types should only be used in code within a '#nullable' annotations context
    "CS8632",
    // The using directive for 'XYZ' appeared previously as global using
    "CS8933",
    // Unnecessary using directive
    "CS8019",
    // 'member' is obsolete
    // https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0612
    "CS0612",
    // 'member' is obsolete: 'text'
    // https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS0618
    "CS0618"
}.Select(id => new KeyValuePair<string, ReportDiagnostic>(id, ReportDiagnostic.Suppress)).ToList();
...
public static SyntaxTree ParseBytes(this byte[] bytes, CSharpParseOptions options, string filePath) => CSharpSyntaxTree.ParseText(SourceText.From(bytes, bytes.Length), options, filePath);
...
var compilation = CSharpCompilation.Create(asmProps.AssemblyName,
    csFiles.Select(o => o.Bytes.ParseBytes(parseOptions, o.FilePath)),
    references,
    new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary,
        assemblyIdentityComparer: DesktopAssemblyIdentityComparer.Default,
        generalDiagnosticOption: ReportDiagnostic.Error,
        specificDiagnosticOptions: s_specificDiagnosticOptions));

The source files are compiled just fine by the C# compiler on the command line, so I know using non-breaking spaces should not be a problem.

The CSharpParseOptions object just contains the define constants and specifies the latest version of the language.

How can I instruct the CSharpCompiler not to freak out upon seeing the non-breaking spaces? I am cautious about just suppressing CS1056, it does not seem right.

EDIT 1

I examined the situation closely. First I thought to sanitize all the source files containing 0xA0 or the combination of 0xC2A0. But that is too costly (thousands of files) and redundant - 99.9% of these files do not fail compilation. Not sure why. But there is one file which does fail.

It does not have a BOM. It does have 0xA0 characters (not the 2 byte sequence 0xC2A0): enter image description here

This file shows up cleanly in Notepad++ (which indicates ASCII encoding). Command line build works fine too.

But UTF8 encoding is indeed unable to represent it correctly: enter image description here

Frankly, this file looks botched to me, but I have a problem - it passes compilation on the command line! Otherwise, it would not have been pushed to master due to our PR build policy.

I know I can sanitize it, but how would I know which files to sanitize in general? Usually I do not run the compilation with diagnostics, because it slows it down significantly and I know all the files in master should pass compilation, so I can afford skipping diagnostics.

Ideas on how to resolve my issue in a general and efficient way are most welcome.


Solution

  • Digging a little deeper I think the encoding of the file is win-1252. Some files without any BOM have the pair 0xC2A0, which means they are UTF8.

    Inspired by the function StreamReader.DetectEncoding, I have come up with the following variant:

    private static readonly Encoding s_unicodeBigEndianWithBOM = new UnicodeEncoding(bigEndian: true, byteOrderMark: true);
    private static readonly Encoding s_unicodeLittleEndianWithBOM = new UnicodeEncoding(bigEndian: false, byteOrderMark: true);
    private static readonly Encoding s_utf32BigEndianWithBOM = new UTF32Encoding(bigEndian: true, byteOrderMark: true);
    private static readonly Encoding s_utf32LittleEndianWithBOM = new UTF32Encoding(bigEndian: false, byteOrderMark: true);
    private static readonly Encoding s_win1252 = Encoding.GetEncoding(1252);
    ...
    private Encoding DetectEncoding(byte[] bytes)
    {
        const byte NBSP_PREFIX = 0xC2;
        const byte NBSP = 0xA0;
    
        if (bytes.Length < 2)
        {
            return Encoding.UTF8;
        }
        if (bytes[0] == 254 && bytes[1] == byte.MaxValue)
        {
            return s_unicodeBigEndianWithBOM;
        }
        if (bytes[0] == byte.MaxValue && bytes[1] == 254)
        {
            if (bytes.Length < 4 || bytes[2] != 0 || bytes[3] != 0)
            {
                return s_unicodeLittleEndianWithBOM;
            }
            return s_utf32LittleEndianWithBOM;
        }
        if (bytes.Length >= 3 && bytes[0] == 239 && bytes[1] == 187 && bytes[2] == 191)
        {
            return Encoding.UTF8;
        }
        if (bytes.Length >= 4 && bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 254 && bytes[3] == byte.MaxValue)
        {
            return s_utf32BigEndianWithBOM;
        }
        int pos = bytes.AsSpan().IndexOf(NBSP);
        if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
        {
            return s_win1252;
        }
        return Encoding.UTF8;
    }
    

    Notice the last part of the function:

    int pos = bytes.AsSpan().IndexOf(NBSP);
    if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
    {
        return s_win1252;
    }
    

    That part is missing from the standard implementation - trying to guess if the encoding is win-1252.

    Using the encoding returned by this function when parsing the files seem to have resolved my issues.