I have source files with non-breaking spaces. Sometimes it is just 0xA0
, but sometimes it is 0xC2 0xA0
(as a pair).
When I parse these files and feed to CSharpCompilation
it returns bag diagnostics messages like this:
c:\xyz\SomeFile.cs(8,1): error CS1056: Unexpected character '�'
Here is how I compile the code:
private static readonly List<KeyValuePair<string, ReportDiagnostic>> s_specificDiagnosticOptions = new[]
{
// Assembly 'AssemblyName1' uses 'TypeName' which has a higher version than referenced assembly 'AssemblyName2'
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1705
"CS1705",
// Assuming assembly reference "Assembly Name #1" matches "Assembly Name #2", you may need to supply runtime policy
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS1701
"CS1701",
// Assuming assembly reference "Assembly Name #1" used by "Type Name #1" matches identity "Assembly Name #2" of "Type Name #2", you may need to supply runtime policy
"CS1702",
// 'member1' hides inherited member 'member2'. Use the new keyword if hiding was intended
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/cs0108
"CS0108",
// The member 'member' does not hide an inherited member. The new keyword is not required
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0109
"CS0109",
// 'function1' hides inherited member 'function2'. To make the current method override that implementation, add the override keyword. Otherwise add the new keyword.
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0114
"CS0114",
// The result of the expression is always 'value1' since a value of type 'value2' is never equal to 'null' of type 'value3'
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0472
"CS0472",
// 'class' overrides Object.Equals(object o) but does not override Object.GetHashCode()
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0659
"CS0659",
// Unreachable code detected
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0162
"CS0162",
// Invalid name for a preprocessing symbol; '' is not a valid identifier
"CS8301",
// The annotation for nullable reference types should only be used in code within a '#nullable' annotations context
"CS8632",
// The using directive for 'XYZ' appeared previously as global using
"CS8933",
// Unnecessary using directive
"CS8019",
// 'member' is obsolete
// https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0612
"CS0612",
// 'member' is obsolete: 'text'
// https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/CS0618
"CS0618"
}.Select(id => new KeyValuePair<string, ReportDiagnostic>(id, ReportDiagnostic.Suppress)).ToList();
...
public static SyntaxTree ParseBytes(this byte[] bytes, CSharpParseOptions options, string filePath) => CSharpSyntaxTree.ParseText(SourceText.From(bytes, bytes.Length), options, filePath);
...
var compilation = CSharpCompilation.Create(asmProps.AssemblyName,
csFiles.Select(o => o.Bytes.ParseBytes(parseOptions, o.FilePath)),
references,
new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary,
assemblyIdentityComparer: DesktopAssemblyIdentityComparer.Default,
generalDiagnosticOption: ReportDiagnostic.Error,
specificDiagnosticOptions: s_specificDiagnosticOptions));
The source files are compiled just fine by the C# compiler on the command line, so I know using non-breaking spaces should not be a problem.
The CSharpParseOptions
object just contains the define constants and specifies the latest version of the language.
How can I instruct the CSharpCompiler
not to freak out upon seeing the non-breaking spaces? I am cautious about just suppressing CS1056
, it does not seem right.
EDIT 1
I examined the situation closely. First I thought to sanitize all the source files containing 0xA0 or the combination of 0xC2A0. But that is too costly (thousands of files) and redundant - 99.9% of these files do not fail compilation. Not sure why. But there is one file which does fail.
It does not have a BOM. It does have 0xA0 characters (not the 2 byte sequence 0xC2A0):
This file shows up cleanly in Notepad++ (which indicates ASCII encoding). Command line build works fine too.
But UTF8 encoding is indeed unable to represent it correctly:
Frankly, this file looks botched to me, but I have a problem - it passes compilation on the command line! Otherwise, it would not have been pushed to master due to our PR build policy.
I know I can sanitize it, but how would I know which files to sanitize in general? Usually I do not run the compilation with diagnostics, because it slows it down significantly and I know all the files in master should pass compilation, so I can afford skipping diagnostics.
Ideas on how to resolve my issue in a general and efficient way are most welcome.
Digging a little deeper I think the encoding of the file is win-1252. Some files without any BOM have the pair 0xC2A0, which means they are UTF8.
Inspired by the function StreamReader.DetectEncoding
, I have come up with the following variant:
private static readonly Encoding s_unicodeBigEndianWithBOM = new UnicodeEncoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_unicodeLittleEndianWithBOM = new UnicodeEncoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_utf32BigEndianWithBOM = new UTF32Encoding(bigEndian: true, byteOrderMark: true);
private static readonly Encoding s_utf32LittleEndianWithBOM = new UTF32Encoding(bigEndian: false, byteOrderMark: true);
private static readonly Encoding s_win1252 = Encoding.GetEncoding(1252);
...
private Encoding DetectEncoding(byte[] bytes)
{
const byte NBSP_PREFIX = 0xC2;
const byte NBSP = 0xA0;
if (bytes.Length < 2)
{
return Encoding.UTF8;
}
if (bytes[0] == 254 && bytes[1] == byte.MaxValue)
{
return s_unicodeBigEndianWithBOM;
}
if (bytes[0] == byte.MaxValue && bytes[1] == 254)
{
if (bytes.Length < 4 || bytes[2] != 0 || bytes[3] != 0)
{
return s_unicodeLittleEndianWithBOM;
}
return s_utf32LittleEndianWithBOM;
}
if (bytes.Length >= 3 && bytes[0] == 239 && bytes[1] == 187 && bytes[2] == 191)
{
return Encoding.UTF8;
}
if (bytes.Length >= 4 && bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 254 && bytes[3] == byte.MaxValue)
{
return s_utf32BigEndianWithBOM;
}
int pos = bytes.AsSpan().IndexOf(NBSP);
if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
{
return s_win1252;
}
return Encoding.UTF8;
}
Notice the last part of the function:
int pos = bytes.AsSpan().IndexOf(NBSP);
if (pos >= 0 && (pos == 0 || bytes[pos - 1] != NBSP_PREFIX))
{
return s_win1252;
}
That part is missing from the standard implementation - trying to guess if the encoding is win-1252.
Using the encoding returned by this function when parsing the files seem to have resolved my issues.