Search code examples
powershellencodingutf-8asciitxt

Simple way to convert txt file from UTF-8 to ASCII


I am trying to convert just one file from UTF-8 to ASCII. I found the following script online, and it creates the Out-File but it does not change the encoding to ASCII. Why is this not working?

Get-Content -Path "File/Path/to/file.txt" | Out-File -FilePath "File/Path/to/processed.txt" -Encoding ASCII

Solution

  • tl;dr

    -Encoding ASCII does work, though your editor's GUI may still report the resulting file as UTF-8-encoded, for the reasons explained below.


    First, a general caveat:

    • If your input file also contains non-ASCII-range characters, they will be transliterated to verbatim ?, i.e. you'll potentially lose information.
    • Conversely, if your input files are UTF-8-encoded but do not contain non-ASCII characters, they in effect already are ASCII-encoded files; see below.

    ASCII encoding is a subset of UTF-8 encoding (except that ASCII encoding never involves a BOM).

    • Therefore, any (BOM-less) file composed exclusively of bytes representing ASCII characters is by definition also a valid UTF-8 file.

    Modern editors default to BOM-less UTF-8; that is, if a file doesn't start with a BOM, they assume that it is UTF-8-encoded, and that's what their GUIs reflect - even if a given file happens to be composed of ASCII characters only.


    To verify that your output file is indeed only composed of ASCII characters, use the following:

    # This should return $false; '\P{IsBasicLatin}' matches any NON-ASCII character.
    (Get-Content -Raw File/Path/to/processed.txt) -cmatch '\P{IsBasicLatin}'
    

    For an explanation of this test, especially with respect to needing to use -cmatch, the case-sensitive variant of the -match operator, see this answer.


    A complete example:

    # Write a string that contains non-ASCII characters to a
    # file with -Encoding Ascii
    # The resulting fill will contain 1 line, with content 'caf?'
    # That is, the "é" character was "lossily" transliterated to (ASCII) "?"
    'café' | Out-File -Encoding Ascii temp.txt
    
    # Examining the file for non-ASCII characters now indicates that
    # there are none, i.e, $false is returned.
    (Get-Content -Raw temp.txt) -cmatch '\P{IsBasicLatin}'