Search code examples
filehaskellbinaryfiles

How to find if a file is binary


I am trying to read text of all files in a folder with following code:

readALine :: FilePath -> IO ()
readALine fname = do 
  putStr . show $ "Filename: " ++ fname ++ "; "
  fs <- getFileSize fname
  if fs > 0 then do 
      hand <- openFile fname ReadMode
      fline  <- hGetLine hand
      hClose hand
      print $ "First line: " <> fline
  else return ()

However, some of these files are binary. How can I find if a given file is binary? I could not find any such function in https://hoogle.haskell.org/?hoogle=binary%20file

Thanks for your help.

Edit: By binary I mean the file has unprintable characters. I am not sure of proper term for these files.

I installed UTF8-string and modified the code:

readALine :: FilePath -> IO ()
readALine fname = do 
  putStr . show $ "Filename: " ++ fname ++ "; "
  fs <- getFileSize fname
  if fs > 0 then do 
      hand <- openFile fname ReadMode
      fline  <- hGetLine hand
      hClose hand
      if isUTF8Encoded (unpack fline) then do
        print $ "Not binary file."
        print $ "First line: " <> fline
      else return ()
  else return ()

Now it works but on encountering a 'binary' executable file (called esync.x), there is error at hGetLine hand expression:

"Filename: ./esync.x; "firstline2.hs: ./esync.x: hGetLine: invalid argument (invalid byte sequence)

How can I check about characters from file handle itself?


Solution

  • The definition of binary is quite vague, but assuming you mean content which is not valid UTF-8 text.

    You should use toString in Data.ByteString.UTF8 which replaces non-UTF-8 characters with a replacement character but doesn't fail with an error.

    Converting your example to use UTF-8 ByteStrings:

    import Data.Monoid
    import System.IO
    import System.Directory
    import qualified Data.ByteString as B
    import qualified Data.ByteString.UTF8 as B
    
    readALine :: FilePath -> IO ()
    readALine fname = do
      putStr . show $ "Filename: " ++ fname ++ "; "
      fs <- getFileSize fname
      if fs > 0 then do
          hand <- openFile fname ReadMode
          fline  <- B.hGetLine hand
          hClose hand
          print $ "First line: " <> B.toString fline
      else return ()
    

    This code doesn't fail on binary but is not really detecting binary content. If you want to detect binary, look for B.replacement_char in your data. To detect non-printable characters, you may look for code points smaller than 32 (space character) as well.