Search code examples
pythongittouchmv

How to transfer contents from __init__.py in git (and maintain history) to another file whilst still keeping empty __init__.py


I created an import scheme that imported from __init__.py, rather than __init__.py importing from it's modules.

To fix this I ran:

$ git mv package/__init__.py package/utils.py

This looked correct:

Changes to be committed:
(use "git restore --staged <file>..." to unstage)
    renamed:    package/__init__.py -> package/utils.py

However if I run the following:

$ touch package/__init__.py

This is what I see:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   package/__init__.py
    new file:   package/utils.py

How can I get git to do the following?

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   package/utils.py
    new file:   package/__init__.py

Solution

  • TL;DR

    You can make two commits if you like. There's not a lot of value to that, but there is a little. Some of its value is positive and some of it is negative. It is your choice.

    Long

    Git has no file history. Git has commits; the commits are the history.

    Commits themselves are relatively simple: each one has a full snapshot of every file, plus some metadata containing things like the name and email address of the author of the commit. The metadata of any one commit includes the raw hash ID(s) of any earlier commit(s). Most commits, called ordinary commits, have one earlier commit, and that one earlier commit also has a snapshot and metadata, which points to one more still-earlier commit, and so on. This is how the snapshots-and-metadata are the history.

    With that in mind, note that git log -p or git show shows an ordinary commit by:

    1. displaying (the interesting part(s) of) its metadata, with formatting; then
    2. showing what changed in that commit.

    In order to achieve item 2, Git actually extracts both the commit and its parent to a temporary area (in memory), and then compares the two sets of snapshot files.1 This comparison takes the form of a diff (git diff), i.e., the difference between two snapshots.

    The git status command also runs git diff. In fact, it runs git diff twice, once to compare the current (aka HEAD) commit to Git's index—your proposed next commit, resulting from any git add updates—and again to compare Git's index to your working tree, in case there are things you forgot to git add. (This form of diff uses at least one snapshot that's not saved in a commit, and one of the two forms uses real files, which takes more work than using Git's shortcut hash ID tricks. But the end result is the same.)

    When Git runs this kind of diff, it can—and now, by default, will—look for renamed files. Its method of finding these renames is imperfect, though. What it does is this:

    • List out all the files on the left ("before" or "old version").
    • List out all the files on the right ("after" or "new version").
    • If there is a pair of files on left and right with the same name, pair those up: they must be the same file.
    • Take all the left-over, unpaired names. Some of these might be renames. Check all the left-side files against all the right-side files.2 If a left-side file is "sufficiently similar" to a right-side file, pair up the best matches. (100%-identical matches go faster here in most cases, and reduce the remaining pile of unpaired names, so Git always does this first.)

    When you ran:

    git mv package/__init__.py package/utils.py
    

    the setup was perfect for Git: every other file matched 100% left and right, and the remaining list was that the left side had __init__.py and the right side had utils.py and the contents matched 100%. So that must be a rename! (In a way, these files are named package/__init__.py etc.: Git considers the whole thing, including the slashes, to be a file name. But it's shorter for me to leave out the package/, and you probably think of these as files-in-a-folder or files-in-a-directory yourself.)

    As soon as you created a new file named __init__.py, however, Git now had both left and right side files named __init__.py, plus this one leftover right-side file named utils.py. So Git paired up the files with the same name and had one left over right-side-only file that cannot be paired.

    If you make a new commit now, with this situation, git diff will continue to find things set up this way, at least until some mythical future Git is smart enough to notice that, even though the two files have the same name, a diff that says "rename and then create anew" is somehow superior.3

    If, however, you make a commit that contains only the renaming step, and then create a new __init__.py file so that the package works right and commit that as a second commit, git log -p and git show will resume detecting the rename. The upside of doing this is that git log --follow, which goes step-by-step and works by changing the name it's looking for when it detects a rename, will work. The downside of doing this is that you will have one commit that is deliberately broken. You should probably note this in its commit message. If you have to do this sort of thing often, and the commit messages consistently mark such commits, you can automatically skip such commits during git bisect by writing your bisect script tester to check for these marks.


    1Technically, Git gets to compare just the hash IDs of trees and blobs, which makes this go very fast in most cases.

    2This checking is very expensive, computationally, so Git has a number of shortcuts, and also a cutoff limit where it just gives up. You can tweak some of these settings.

    3If some future git diff is this smart, the future Git author will have to consider whether this might break some scripts. Fortunately git diff is a porcelain command, not a plumbing command, but git diff-tree and the other plumbing commands will need new options.