Search code examples
rubygithub-linguist

How can I detect the programming language of a snippet?


I have a string containing some text. The text may or may not be code. Using Github's Linguist, I have been able to detect the likely programming language only if I give it a list of candidates.

# test_linguist_1.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = [Linguist::Language["Python"], Linguist::Language["C"], Linguist::Language["Ruby"]]
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect

Execution:

$ ./test_linguist_1.rb 
[#<Linguist::Language name=C>, #<Linguist::Language name=Python>, #<Linguist::Language name=Ruby>]

Notice that I gave it a list of candidates. How can I avoid having to define a list of candidates?

I tried the following:

# test_linguist_2.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = Linguist::Language.all
# I also tried only Popular
# candidates = Linguist.Language.popular
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect    

Execution:

$ ./test_linguist_2.rb 
/home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:131:in `token_probability': undefined method `[]' for nil:NilClass (NoMethodError)
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:120:in `block in tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `inject'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:105:in `block in classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:78:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:20:in `call'
from ./test_linguist.rb:21:in `block in <main>'
from ./test_linguist.rb:14:in `each'
from ./test_linguist.rb:14:in `<main>'

Additional:

  1. Is this the best way to use Github Linguist? FileBlob is an alternative to Blob but this requires writing my string to a file. This is problematic for two reasons 1) it is slow, and 2) the chosen file extension then guides linguist and we do not know the correct file extension.
  2. Are there better tools to do this? Github Linguist perhaps works well over files but not over strings.

Solution

  • Taking a quick look at the source code of Linguist, it appears to use a number of strategies to determine the language, and it calls each strategy in turn. Classifier is the last strategy to be called, by which time it has (hopefully) picked up language "candidates" (as you've discovered for yourself) from the prior strategies. So I think for the particular sample you've shared with us, you have to pass a filename of some kind, even if a file doesn't actually exist, or a list of language candidates. If neither is an option for you, this may not be a feasible solution for your problem.

    $ ruby -r linguist -e 'p Linguist::Blob.new("foo.c", "int main(){}").language'
    #<Linguist::Language name=C>
    

    It returns nil without a filename, and #<Linguist::Language name=C++> with "foo.cc" and the same code sample.

    The good news is that you picked a really bad sample to test with. :-) Other strategies look at modelines and shebangs, so more complex samples have a better chance at succeeding. Take a look at these:

    $ ruby -r linguist -e 'p Linguist::Blob.new("", "#!/usr/bin/env perl
    print q{Hello, world!};
    ").language'
    #<Linguist::Language name=Perl>
    $ ruby -r linguist -e 'p Linguist::Blob.new("", "# vim: ft=ruby
    puts %q{Hello, world!}
    ").language'
    #<Linguist::Language name=Ruby>
    

    However, if there isn't a shebang or a modeline, we're still out of luck. It turns out that there's a training dataset that is computed and serialized to disk at install time, and automatically loaded during language detection. Unfortunately, I think there's a bug in the library that is preventing this training dataset from being used if there aren't any candidates by the time it gets to this step. Fixing the bug lets me do this:

    $ ruby -Ilib -r linguist -e 'p Linguist::Blob.new("", "int main(){}").language'
    #<Linguist::Language name=XC>
    

    (I don't know what XC is, but adding some other tokens to the string such as #include <stdio.h> or int argc, char* argv[] gives C. I'm sure most of your samples will have more meat to analyze.)

    It's a real simple fix and I've submitted a PR for it. You can use my fork of the Gem if you'd like in the meantime. Otherwise, we'll need to look into using Linguist::Classify directly, as you've started exploring, but that has the potential to get messy.

    To use my fork, add/modify your Gemfile to read as such:

    gem 'github-linguist',
      require: 'linguist',
      git: 'https://github.com/mwpastore/linguist.git',
      branch: 'fix-no-candidates'
    

    I'll try to come back and update this answer when the PR has been merged and a new version of the Gem has been released with the fix. If I have to do any force-pushes to meet the repository guidelines and/or make the maintainers happy, you may have to do a bundler update to reflect the changes. Let me know if you have any questions.