Search code examples
windowsgitmercurialgit-bashwindows-11

Git does not recognise files with umlauts correctly on Windows 11 after migration from Mercurial


I try a migration from a Mercurial repository to Git on Windows 11 in the following way in Git Bash:

MINGW64$ ls
hg-repo/ git-repo/
MINGW64$ cd git-repo
MINGW64$ git init
MINGW64$ ~/fast-export/hg-fast-export.sh -r ../hg-repo/ --force -A ../hg-repo/authors.txt -M main

The migration succeeds and the following is needed

MINGW64$ git checkout main

which should result in a repository with no changes. But instead I get something as the following:

MINGW64$ git status
On branch main
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
    deleted:    Folder1/grünes-Ding.png
Untracked files:
(use "git add <file>..." to include in what will be committed)
    Änderungen/
    Folder1/grünes-Ding.png

So it looks like "Folder1/grünes-Ding.png" was deleted and then added again. If I try to restore the folder I get the following.

MINGW64$ git restore Folder1/grünes-Ding.png
error: pathspec 'Folder1/grünes-Ding.png' did not match any file(s) known to git

I think in this case Git does not understand "Folder1/grünes-Ding.png" because ü is represented in another way in Git as I see it in git-bash. "Änderungen/" should be also in the repository. Because if I delete it in the working directory, it appears with all its files as "deleted" changes. If I then try to restore these files I get the same error type. The files in this folder does not contain umlauts.

My question is: How can I tell Git to handle folders and files with Umlauts?

The only thing I found so far regarding umlauts was showing them correctly in logs or commit messages. But this is not the problem here.

My config of Git looks like this:

MINGW64$ git config -l
diff.astextplain.textconv=astextplain
http.sslbackend=openssl
http.sslcainfo=C:/Program Files/Git/mingw64/ssl/certs/ca-bundle.crt
core.autocrlf=input
core.fscache=true
core.symlinks=false
pull.rebase=false
init.defaultbranch=main
difftool.sourcetree.cmd=''
mergetool.sourcetree.cmd=''
mergetool.sourcetree.trustexitcode=true
core.repositoryformatversion=0
core.filemode=false
core.bare=false
core.logallrefupdates=true
core.symlinks=false
core.ignorecase=true
core.quotepath=false
core.fsmonitor=true
i18n.logoutputencoding=UTF-8
MINGW64$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

Solution

  • I played a little bit around with the options of hg-fast-export and found a solution, eventually.

    hg-fast-export has two options handling the encoding: -e and --fe. -e defines the encoding of the commit messages and author names etc. in Mercurial to convert it to UTF-8 and --fe defines the encoding of the filenames.

    I tried different encodings for the filenames and found that latin1 worked for me. But first, I made the mistake to use -fe instead of --fe. But -fe results in -f and -e and not --fe. So be aware of this! If you use -e, also the option --fe is automatically set to the value of -e which then results in wrong encoding of commit messages.

    Finally, the migration works like this

    MINGW64$ ls
    hg-repo/ git-repo/
    MINGW64$ cd git-repo
    MINGW64$ git init
    MINGW64$ ~/fast-export/hg-fast-export.sh -r ../hg-repo/ --force -A ../hg-repo/authors.txt -M main --fe latin1