Search code examples
gitgitignore

How could I make gitignore works with other language(like Korean)


I tried to add Korean file(or other language) to gitignore, but it didn't work

in .gitignore

#ignore 예제파일/ (=exmapleFile/)
예제파일/

Any suggestion?


Solution

  • iBug's comment has one of the keys to making this work. The other is to be sure that the file is untracked.

    Untracked files are those that are not in the index

    The index, which is also called the staging area or sometimes the cache, controls whether a file is tracked or untracked. The index is also what Git uses when making new commits, so every file that is in the index goes into the next commit you make, once you make it. To see a list of every file in the index, along with its staging information, use git ls-files --stage (note that this can be a very long list!): the file's path names appear at the end of each output line.

    Git reports an untracked file when, in the process of scanning through a directory, it comes across a file whose path name is (a) not already in the index and (b) not listed in an ignore-or-exclude file. (There is some special handling for directories here, but let's leave that for later.)

    In other words, any file in the index is tracked. A file that is not in the index is untracked, and some untracked files are also ignored. Crucially, a tracked file is never ignored.

    Path names are UTF-8 strings

    For files with simple ASCII style names like README.txt or Documentation/RelNotes/2.9.5.txt, the path name is pretty obvious. It is encoded as a byte-string: the R in README or RelNotes is a byte with value 82 (in decimal anyway: it is 0x52 in hexadecimal or 0122 in octal). But for other characters, such as the ö in schön or the é in agréable, or of course your 예제파일 (which I had to cut-and-paste here :-) ), there is a problem with encoding.

    Git chooses to assume that all file names are encoded in UTF-8. Your operating system may choose some other encoding internally—for instance, Windows uses UTF-16 in a number of its file systems—but Git assumes UTF-8, which has numerous advantages including not requiring a byte order marker (BOM). This does not solve all problems—there are still issues with normalization—but points us to the answer we want for .gitignore files.

    (Git also uses this UTF-8 form in the index.)

    When Git goes to read a .gitignore file, it opens it as a stream of bytes, which should contain the UTF-8 encoding for each file name, terminated by newlines. Then, when Git goes to read a directory to extract file (or sub-directory) names from the operating system, Git will convert these names to UTF-8 strings. If those file names represent untracked files, Git will compare the resulting UTF-8 strings with the UTF-8-encoded strings in each line in the .gitignore file.

    If the UTF-8 encoded strings match, the untracked file's name is ignored (or un-ignored if prefixed with !, since of course all the usual rules apply).

    If the contents of the .gitignore file are not UTF-8 encoded strings, the attempt to ignore will fail, because a UTF-8 representation of 예제파일 (b'\xec\x98\x88\xec\xa0\x9c\xed\x8c\x8c\xec\x9d\xbc' in Python, for instance) will not match a UTF-16LE representation of the same characters:

    >>> fn = b'\xec\x98\x88\xec\xa0\x9c\xed\x8c\x8c\xec\x9d\xbc'
    >>> fn
    b'\xec\x98\x88\xec\xa0\x9c\xed\x8c\x8c\xec\x9d\xbc'
    >>> fn.decode('utf-8')
    '예제파일'
    >>> fn.decode('utf-8').encode('utf-16le')
    b'\x08\xc6\x1c\xc8\x0c\xd3|\xc7'
    

    Side note: directories and files

    Git stores only files in a repository. This creates a bit of tension between directories—which must exist to hold the files—and the files themselves. One side effect is that you can't store an empty directory in a Git commit (see How can I add an empty directory to a Git repository?), but another comes up with using .gitignore.

    The operating system's facilities for finding files generally requires that you start by looking inside a directory (or "folder", if you prefer that metaphor). This directory has a name inside the file system. Git will open the directory, by its name, and read through its contents, one entry at a time. Each entry will list either a file's name, or another directory's name. Git can check each such file-name—after combining it with the parent directory's name and a slash, giving dir/README.txt for instance—against the index (to see if it's tracked) and then, if not tracked, against all ignore lists (to see if Git should complain about it, or ignore it).

    But searching inside a directory is relatively slow. Suppose that Git has a path like a/b/c/d that represents a directory. Git can and does first look in the index to see if there are any files already tracked within a/b/c/d. If so, Git must read the directory. But if not, Git can now check all the ignore lists to see if a/b/c/d itself is ignored.

    If a/b/c/d is ignored, Git is not forced to read its contents! If there are millions of files within a/b/c/d—whether in subdirectories or not—this is a major time savings. So Git does that, too. If Git never looks inside a/b/c/d, it will never find any untracked files within a/b/c/d. This is why you must explicitly un-ignore directories in some cases: to force Git to look inside them for untracked files.

    (One might think that listing, in a .gitignore, something like:

    a/b/c/d
    !a/b/c/d/e/important.file
    

    would be enough to tell Git: yes, ignore everything within a/b/c/d, but still look inside d for d/e and subsequently d/e/important.file since you will have to look inside it to un-ignore such a file. And Git may become this smart at some point, but historically, it has not been. So the rule for this is to list it as:

    a/b/c/d/*
    !a/b/c/d/e
    a/b/c/d/e/*
    !a/b/c/d/e/important.file
    

    which overrides the "ignore everything" rule for a/b/c/d/e: a/b/c/d itself is not ignored, so Git opens and reads it. Then a/b/c/d/any is ignored unless any is explicitly e, which is not ignored. So Git opens a/b/c/d/e and reads it. Anything in a/b/c/d/e is ignored except for important.file.)