Search code examples
githubgithub-linguist

How does github figure out a project's language?


I was recently working on a github project in both JavaScript and C++, and noticed that github tagged the project as C++. If you have to pick a single language, this is probably the correct designation since the C++ code is compiled as a JavaScript library, but this made me wonder... how does github figure out what language to tag each project?


Solution

  • Update April 2013, by nuclearsandwich (GitHub support team or "supportocat"):

    If your desired language is not receiving syntax highlighting you can contribute to the Linguist library to add it.


    (Original answer, Oct. 2012)

    This thread on GitHub support explains it:

    It just sums up file sizes for each extension. Largest one "wins".

    We'd like to avoid opening files up and parsing their content, as both would slow down the process... but that might be the only method of resolving conflicts like this one.

    Since this is not 100% accurate, that had lead some to add:

    I, too, would vote for a simple manual-override switch for the cases where the guess is wrong.


    Note: as Mark Rushakoff mentions in his answer (upvoted), the guessing got better since then with the linguist project (open-sourced from June 2011).
    You can see there are still issues though: GitHub Linguist Issues.
    See here for more details:

    Once the language has been detected, it is passed to Albino, a Pygments wrapper, which does the actual syntax highlighting.

    And you can add linguist directives in a .gitattributes file.