Search code examples
gitgit-rebasegit-filter-branchgitattributesgit-filter-repo

Correct way to store .gitattributes (working-tree-encoding) after many commits?


There is a large Git repository (~4,000 commits) containing files in CP866 and does not contain a file named .gitattributes in the root of project. Is there any way to add .gitattributes (*.txt text working-tree-encoding=CP866) with rewriting everything as if it existed from the beginning?

Tried git rebase -i --root and got conflicts on every commit after adding .gitattributes in root.


Solution

  • git rebase certainly should give conflicts on every commit due to different encoding. I couldn't find a simple way to use git filter-repo: its --blob-callback knows the data but doesn't know the file name which we should match against the mask *.txt; --commit-callback knows the files but only provides blob IDs so the content must be extracted and written separately.

    So the following solution uses git filter-branch. I use --index-filter, it's much faster than --tree-filter (on 4000 commits it still will be slow, alas) and we have all the information in the index — file names and content. What the code does: first it creates new .gitattributes, then it runs a loop over all *.txt files recursively, convert them from CP866 to UTF-8 and updates the index. At the end it forces checkout of the converted files in the proper encoding. I took the main part of code from the answer, thanks @jthill! Found in https://stackoverflow.com/search?q=%5Bgit-filter-branch%5D+file+content

    Before running any code please make a backup or run the code in a temporary copy of your repository!

    Here is the code:

    #! /bin/sh
    set -e
    
    FILTER_BRANCH_SQUELCH_WARNING=1 git filter-branch --index-filter '
    set -e
    f=.gitattributes
    updated=$(
        echo "*.txt working-tree-encoding=cp866" |
            git hash-object -w --stdin --path=$f
    )
    git update-index --add --cacheinfo 100644,$updated,$f
    
    for f in $(git ls-files "*.txt"); do
        updated=$(
            git cat-file blob ":$f" | iconv -f cp866 -t utf-8 |
                git hash-object -w --stdin --path="$f"
        )
        git update-index --add --cacheinfo 100644,$updated,"$f"
    done
    ' HEAD
    
    # Checkout the files in the proper encoding
    find . -name "*.txt" -delete
    git restore "*.txt"
    

    I tested it on my repository https://github.com/phdru/m_librarian ; only converted README.rus.txt from KOI8-R to UTF-8. The real code I used is:

    #! /bin/sh
    set -e
    cd m_librarian
    
    FILTER_BRANCH_SQUELCH_WARNING=1 git filter-branch --index-filter '
    set -e
    f=.gitattributes
    updated=$(
        git cat-file blob :$f |
            sed "s!/README.rus.txt encoding=utf-8!/README.rus.txt working-tree-encoding=koi8-r!" |
            git hash-object -w --stdin --path=$f
    )
    git update-index --add --cacheinfo 100644,$updated,$f #&&
    
    f=README.rus.txt
    if ! git cat-file blob :$f | iconv -f utf-8 -t koi8-r >/dev/null 2>&1; then
        updated=$(
            git cat-file blob :$f | iconv -f koi8-r -t utf-8 |
                git hash-object -w --stdin --path=$f
        )
        git update-index --add --cacheinfo 100644,$updated,$f
    fi
    ' b4c32de..master
    
    # Checkout the file in the proper encoding
    rm README.rus.txt
    git restore README.rus.txt