Search code examples
c#file-encodings

Detect if file contains text


Possible Duplicate:
How can I determine if a file is binary or text in c#?
C# - Check if File is Text Based

To better understand multi threading and asynchronous tasks, I wrote a simple application in C# to count the total number of lines of code in a project (directory).

Currently, I open a file and count the number of lines in each file. However, that includes all files (jpg, png, exe etc.). Is there a way I can detect if a file is a text file? Possibly by detecting ASCII Encoding or something similar.


Solution

  • Generally, you cannot reliably detect if the file is a text file. It starts with the general issue, what actually is "a text file". You already hinted at encodings, but especially those cannot be reliably detected (for example see Notepad's struggle).

    Having that said, you might be able to employ the heuristics to do you best (including, but of course not limited to file extensions; excluding well known non-file types like EXE, DLL, ZIP, image files, by recognizing their signature; maybe combined with the approach used by browsers or Notepad).

    Depending on your application, I guess it would be pretty much feasibly, to just let the user select the files to scan (maybe having a default list of extensions to include, like *.cs, *.txt, *.resx, *.xml, ...). If a file(type) / extension is not in the default list and was not added by the user, it is not counted. If the user adds a filetype/extension to the list that is not a "text file", the results are not useful.

    But comparing effort and the fact that an automatic result will never be 100% exact (at detecting all possible files) it should be good enough.