powershell batch-file encoding utf-8 byte-order-mark

Remove BOM from UTF 8 using cmd

I need to remove BOM using cmd file from MyFile.txt. The file is located here.

| Out-File -encoding utf8 '%CD%\MyFile.txt'

I need it to be removed using only one cmd, that is, in the next lines. I need it to be backwards compatible to Windows 7. If I needed it for myself only I would just use -encoding default, but its not backwards compatible even to win 10. Just one file. There were many different questions about BOM on different situations, my issue is I need to use one .cmd, I already have utf8 with BOM and I need it without BOM. Please help me.

I was trying to use powershell but the issue is powershell syntax is not really compatible to cmd? It just says about unrecognised syntax everytime I tried anything from the very popular theme like this Using PowerShell to write a file in UTF-8 without the BOM.

Solution

PowerShell is indeed your best bet, and while you cannot directly use PowerShell commands from cmd.exe / a batch file, you can pass them to powershell.exe, the Windows PowerShell CLI (the solutions below also work on the no longer supported Windows 7 edition of Windows, as requested).

Preface:

If your file is created with Windows PowerShell code, as seems to be the case, you can avoid your problem altogether, by writing a BOM-less UTF-8 file to begin with, using one of the workarounds summarized in this answer in lieu of Out-File -Encoding utf8
- In PowerShell (Core) 7, the problem wouldn't even arise, because there all cmdlets consistently default to BOM-less UTF-8; however, this - install-on-demand - PowerShell edition isn't supported on Windows 7.
The solutions below address your question as asked:
- How to remove a UTF-8 BOM after the fact, from an existing file, relying on Windows PowerShell, the legacy Windows-only edition of PowerShell whose latest and last version is 5.1, which ships with Windows since Windows 7 / Windows Server 2008 R2.^[1]
- Additionally, the use case of transcoding the input file, i.e. saving it with a different character encoding is addressed.

Here's a sample batch file that demonstrates a solution, i.e. it converts a UTF-8 file with BOM to a BOM-less one in-place (best to make a backup copy first):

@echo off & setlocal

:: Specify the input file, assumed to be a UTF-8 file *with-BOM*.
set "targetFile=%CD%\MyFile.txt"

:: Call Windows PowerShell in order to 
:: convert the file to a *BOM-less** UTF-8 file.
powershell -noprofile -c $null = New-Item $env:targetFile -Force -Value (Get-Content -Raw -LiteralPath $env:targetFile)

^{Note: The -c (-Command) and -noprofile CLI parameters aren't strictly necessary, but are included for conceptual clarity and to avoid unnecessary processing of profile files.}

Alternatively, using the [IO.File]::WriteAllText() .NET API:

:: ...

powershell -noprofile -c [IO.File]::WriteAllText((Convert-Path -LiteralPath $env:targetFile), (Get-Content -Raw -LiteralPath $env:targetFile))

Note:

The PowerShell code takes advantage of the fact that New-Item, when given a -Value argument, creates a BOM-less UTF-8 file, even in Windows PowerShell (the legacy, ships-with-Windows, Windows-only edition of PowerShell whose latest and last version is 5.1); ditto for [IO.File]::WriteAllText() - see this answer for details; Get-Content automatically recognizes a UTF-8 file with BOM as such (and faithfully reads it into a .NET string on which New-Item then operates).
- Note: The above uses a text-based approach; strictly speaking, if you know your input files to always be a UTF-8 encoded, to always have a BOM, and you want to convert them to BOM-less UTF-8, you can alternatively use byte processing, namely by simply rewriting the file with the first 3 bytes (which comprise the BOM) omitted. However, a text-based approach, while more memory-intensive, offers more flexibility, especially when transcoding is needed, as discussed later:
  - js2010's answer shows a conceptually straightforward way to remove the first 3 bytes; however, since each byte from the input file is sent individually through the pipeline, processing is quite slow; on the flip side, this approach keeps memory consumption low.
  - A faster alternative - which reads all bytes into memory first - is the following (a memory-friendly alternative is possible, but requires more work; if you were to call PowerShell (Core) 7 instead, via pwsh.exe, you'd have to replace -Encoding Byte with -AsByteStream):
```
:: IMPORTANT: If the input file is *not* a UTF-8 file
::            with a BOM, the following will *corrupt it*.
powershell -noprofile -c [IO.File]::WriteAllBytes((Convert-Path -LiteralPath $env:targetFile), [Linq.Enumerable]::Skip((Get-Content -Raw -Encoding Byte -LiteralPath $env:targetFile), 3))
```
- If, judging by your later comments, you're actually trying to transcode your file, i.e. to switch to a different character encoding, you can use Set-Content with an -Encoding argument; however, in Windows PowerShell you're limited to a fixed number of Unicode encodings plus only the active OEM and ANSI code pages, which you can target with -Encoding Oem and -Encoding Default, respectively, and Get-Content and Set-Content even default to the latter.
- Thus, if you want to transcode your file to a (by definition BOM-less) file encoded using the Windows-1251 code page (Cyrillic), for instance:
  - The following only works in Windows PowerShell if that code page is also the system's active ANSI code page (verify with [Text.Encoding]::Default.WebName; when calling from a batch file, you can change to Windows-1251 with chcp 1251 first, but note that, unless you explicitly restore the previous code page afterwards, this will remain in effect for the remainder of the batch file and, if your batch file was started from an interactive cmd.exe session, for the remainder of the latter); -Encoding Default, which selects the system's active ANSI code page, is implied with Set-Content and therefore omitted below (you would need it with Out-File, however, though use of this cmdlet isn't necessary for writing text):
```
 :: Transcode the file to Windows-1251, 
 :: assuming the latter is the active ANSI code page. 
 :: Note: `-Encoding Default` is implied for Set-Content in Windows PowerShell
 powershell -noprofile -c Set-Content -NoNewLine -LiteralPath $env:targetFile -Value (Get-Content -Raw -LiteralPath $env:targetFile)
```
  - Otherwise, you need to use .NET APIs directly, so you can use [Text.Encoding]::GetEncoding(1251) to obtain the Windows-1251 encoding irrespective of the system's active ANSI code page and pass it to [IO.File]::WriteAllText():^[2]
```
 :: Transcode the file to Windows-1251, irrespective
 :: of what the active ANSI code page is.
 powershell -noprofile -c [IO.File]::WriteAllText((Convert-Path -LiteralPath $env:targetFile), (Get-Content -Raw -LiteralPath $env:targetFile), [Text.Encoding]::GetEncoding(1251))
```
    - Note that this approach can be adapted to perform transcoding from any supported input encoding to any supported output encoding. While Get-Content's -Encoding parameter is subject to the same limitation as Set-Content's, you can replace Get-Content with a call to [IO.File]::ReadAllText(), to which you can similarly pass any [Text.Encoding] instance.
  - Note that this transcoding is potentially lossy, because not all Unicode characters can be represented in a fixed, single-byte encoding such as Windows-1251, given that only 256 characters can be represented.
Caveat: By reading the entire file into memory first, with Get-Content -Raw, and rewriting it in full, in place - with BOM-less UTF-8 encoding - there is a hypothetical risk of data loss, however unlikely: if rewriting the file's content gets interrupted, say due to a power outage, data loss may occur.
- To eliminate this risk, you'd have to write to a temporary file first, and then, once the temporary file was successfully written, replace the original file with it.
- Similarly, you'd need a temporary file if the input file is too large to fit into memory as a whole (which is not typical for text files); in that case, you can combine [IO.File]::ReadLines() with [IO.File]::WriteAllLines() to read and write line by line.

The following solution addresses these problems:

It uses a temporary file and reads and writes lines one at a time.
It is therefore a more robust solution that is memory-friendly, albeit at the expense of implementation complexity and speed (though the latter typically won't matter)

@echo off & setlocal

set "targetFile=%CD%\MyFile.txt"

powershell -noprofile -c $ErrorActionPreference='Stop'; $tempFile=New-TemporaryFile; $inFile=Convert-Path -LiteralPath $env:targetFile; [IO.File]::WriteAllLines($tempFile, [IO.File]::ReadLines($inFile)); $tempFile ^| Move-Item -Force -Destination $inFile

Note:

Passing multiple statements to powershell.exe gets unwieldy, as you must either pass them all on a single line or use cmd.exe's line-continuations; here's a readable reformulation that you could use if you placed the code in a PowerShell script file (*.ps1) and then called powershell -noprofile -file yourScript.ps1 (though if you took that approach, it'd be worth generalizing the code to accept the input file path as an argument):
```
# PowerShell code for use in a *.ps1 script file.
$ErrorActionPreference = 'Stop'
$tempFile = New-TemporaryFile
$inFile = Convert-Path -LiteralPath $env:targetFile
[IO.File]::WriteAllLines(
  $tempFile,
  [IO.File]::ReadLines($inFile)
)
$tempFile | Move-Item -Force -Destination $inFile
```
Note that while the use of [IO.File]::WriteAllLines() and [IO.File]::ReadLines() without specifying a character encoding may look like a no-op in terms of BOM removal, it does work as intended: .NET recognizes a UTF-8 file with BOM on reading, and, on writing, writes a UTF-8 file without BOM by default.
- If you're looking to transcode your file, as discussed above, you can pass a [System.Text.Encoding] instance to the third parameter of [IO.File]::WriteAllLines(), which gives you full access to all encodings supported by .NET; e.g., for Windows-1251:
```
 # PowerShell code for use in a *.ps1 script file.
 $ErrorActionPreference = 'Stop'
 $tempFile = New-TemporaryFile
 $inFile = Convert-Path -LiteralPath $env:targetFile
 # Note the [Text.Encoding] argument.
 [IO.File]::WriteAllLines(
   $tempFile,
   [IO.File]::ReadLines($inFile),
   [Text.Encoding]::GetEncoding(1251)
 )
 $tempFile | Move-Item -Force -Destination $inFile
```
- Similarly, you can optionally pass a [Text.Encoding] instance to the [IO.File]::ReadLines() method, in case you also need to specify the input encoding explicitly (which would apply to non-Unicode files and UTF-7 files).

^{[1] These versions originally shipped with v2 of Windows PowerShell, though upgrades to later versions were possible. The solutions in this answer work even in v2.

Calling Windows PowerShell from a batch file via powershell.exe) is the only way to solve your problem with built-in features; cmd.exe's built-in features are far less powerful and offer no solution, and Windows doesn't ship with a file-transcoding utility (whereas Unix-like platforms come with the iconv utility.}

^{[2] Note that in PowerShell (Core) 7 you wouldn't need to resort to .NET APIs, because the -Encoding parameter there now accepts any [Text.Encoding] instance, either directly, or by name (e.g., -Encoding Windows-1251) or by code-page number (e.g., -Encoding 1251); applied to your scenario:

pwsh -noprofile -c Set-Content -Encoding 1251 -NoNewLine -LiteralPath $env:targetFile -Value (Get-Content -Raw -LiteralPath $env:targetFile)}