Search code examples
gitelasticsearch

Using Elasticsearch to search a git repository with branches


I would like to use Elasticsearch to search a git repository. That seems relatively easy if you look only at the latest main branch - you index all files as documents and then you can search through those documents to find relevant files.

But I would like to be able to search also at branches a repository might have. One easy approach is that I look at each branch as its own repository, index documents there (or each branch into its own index, or using some field to store the branch name and then filter on that).

But that makes many documents duplicated. In most cases, a branch has most files the same as the parent branch, only few files are changed. So I am asking if there is some better way to search over documents where documents can have multiple versions. So there is main branch and then there are other branches which have a path to the main branch, (e.g., branch feature123 has been forked from branch fix321 which has been forked from branch main). When you search inside branch feature123 I would like that if finds files which have been modified in that branch, if a file has not been modified in that branch, it goes one branch up to fix321 to find the file and if not, it goes to main. So feature123 branch shadows files from fix321 if they have been modified there.

I do not want search results to have duplicates (that same file is found multiple times because it "exists" in multiple branches - only the latest changed file should be found). Also, I would like that aggregations and counts work as well and ignore documents which have been shadowed.

(The git repository does not contain code but primarily regular text files. So other issues with code searching do not necessary apply here.)


Solution

  • In order to make it easier and faster at search time, we need to spend some time thinking about how to index the documents in a smart way.

    The document structure

    The document structure I'm suggesting looks like this:

    {
      "branch": "main",
      "search_branches": [
        "main",
        "fix123"
      ],
      "file": "bar.txt",
      "content": "main bar content",
      "sha1": "123"
    }
    

    A quick explanation of the fields:

    • branch is the branch where the file is located
    • search_branches are all the branches that this file can be searched in, i.e. all the branches where this file has the same content
    • file is the file name
    • content is the searchable file content

    Now let's take the example you mentioned in your comment, i.e.:

    • main branch has
      • foo.txt
      • bar.txt
    • fix123 branch has
      • foo.txt (modified)
      • bar.txt (unmodified)

    And index them using the proposed structure:

    POST git/_bulk
    {"index":{}}
    {"branch": "main", "search_branches": ["main"], "file": "foo.txt", "content": "main foo content"}
    {"index":{}}
    {"branch": "main", "search_branches": ["main", "fix123"], "file": "bar.txt", "content": "main bar content"}
    {"index":{}}
    {"branch": "fix123", "search_branch": "fix123", "file": "foo.txt", "content": "fix foo content"}
    

    The searches

    This structure makes all queries very simple, without the need to resort to aggregations or field collapsing. For instance, using the sample searches you mentioned in your comment, let's search for the word content in both branches in turn:

    # if I search the main branch, I want to ignore foo.txt from fix123
    POST git/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "content": "content"
              }
            }
          ],
          "filter": [
            {
              "term": {
                "search_branch": "main"
              }
            }
          ]
        }
      }
    }
    

    The results would only contain results from the main branch, i.e. foo.txt and bar.txt from the main branch

    # if I search fix123 branch, I want to search both main's bar.txt and fix123's foo.txt
    POST git/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "content": "content"
              }
            }
          ],
          "filter": [
            {
              "term": {
                "search_branch": "fix123"
              }
            }
          ]
        }
      }
    }
    

    The results would only contain results from the fix123 branch, i.e. foo.txt from the fix123 branch and bar.txt from the main branch

    Indexing logic

    So, how do we get to that structure? It's pretty easy actually. The main concept here is that each file in the tree that contains final content should be indexed in a document of its own with search_branches containing the name of the current branch and all other branches created from the current one and its descendants. By "final content", I mean content that hasn't changed in the child branches.

    This requires some indexing logic to build up the search_branches field in each document, but that's really the only difficult thing to do. And after that, you can search any branch in isolation, or even any set of branches.

    First, I would index the main branch, so that each file becomes a document of its own using the SHA1 as the ID. Each file would now have search_branches: main.

    Then I would visit one branch after another, and for each file, I would apply this logic:

    1. if it's the same, do not index it (i.e. not adding duplicate content that hasn't changed between branches), but add the branch name to the search_branches field in the document from the parent branch
    2. if it's different, index it (because the content would be different) with search_branches: branch_name

    At this point, you have the guarantee that all documents, in whichever branches, have some sort of differing content that you might want to search. You also have the guarantee that all files that have the same content in different branches are contained in a document with search_branches having the name of all branches where this same content is present.

    The script

    Since we're only indexing files with different content, we can use the SHA1 fingerprint as ID. If we do so, it becomes very simple to make the initial load using scripted upserts.

    Let's store the script containing our logic in the cluster state, so we don't have to provide it in each bulk element later:

    POST _scripts/git_load
    {
      "script": {
        "lang": "painless",
        "source": """
          if ( ctx.op == 'create' ) {
            ctx._source.putAll(params);
            ctx._source.search_branches = [params.branch];
          } else {
            ctx._source.search_branches.add(params.branch);
          }
        """
      }
    }
    

    The logic is pretty simple:

    1. if a document doesn't exist, index it
    2. if it does exist, update its search_branches field

    Then we can use the script, as follows, by providing the document content in the script params:

    POST git/_update/<sha1_id>
    {
      "upsert": {},
      "scripted_upsert": true,
      "script": {
        "id": "git_load",
        "params": {
          "branch": "main",
          "file": "foo.txt",
          "content": "main foo content",
          "sha1": "sha1_id"
        }
      }
    }
    

    Initial load

    For the initial bulk load, you simply have to iterate all branches and files of your repository and provide one such upsert element per document:

    POST git/_bulk
    {"update":{ "_id": "123"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"main","file":"foo.txt","content":"main foo content","sha1":"123"}},"upsert":{}}
    {"update":{ "_id": "456"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"main","file":"bar.txt","content":"main bar content","sha1":"456"}},"upsert":{}}
    {"update":{ "_id": "789"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"foo.txt","content":"fix foo content","sha1":"789"}},"upsert":{}}
    {"update":{ "_id": "456"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"bar.txt","content":"main bar content","sha1":"456"}},"upsert":{}}
    

    What about updates?

    No code lives forever and is always getting modified, so let's talk about updates. Whenever a file changes, we need to update the document from main and create a new one for the branch in which the document has been modified.

    Let's take the example of bar.txt that we modify in branch fix123 so that it becomes different from the content in main. We need to issue two upserts, one to modify the parent document (i.e. to remove fix123 from search_branches in the parent document) and another one to create the child document.

    For that we need the old SHA1 and the new one and modify the script a little bit. We add a case that checks if the remove_branch parameter is present and if it's the case, we can remove it from the search_branches field of the existing document. The rest is the same:

    POST _scripts/git_load
    {
      "script": {
        "lang": "painless",
        "source": """
          if ( ctx.op == 'create' ) {
            ctx._source.putAll(params);
            ctx._source.search_branches = [params.branch];
          } else if (params.remove_branch != null) {
            ctx._source.search_branches.removeIf(elem -> elem.equals(params.remove_branch));
          } else {
            ctx._source.search_branches.add(params.branch);
          }
        """
      }
    }
    

    We can now issue the two following upserts to create the new bar document in the fix123 branch. Also note that in order to update the older document, you don't need to send anything else than the remove_branch field.

    POST git/_bulk
    {"update":{ "_id": "456"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"remove_branch": "fix123"}},"upsert":{}}
    {"update":{ "_id": "xyz"}}
    {"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"bar.txt","content":"fix bar content","sha1":"xyz"}},"upsert":{}}
    

    That's pretty much it. If your repository is not that big, I would recreate the index from scratch on every update, otherwise using the initial load + bulk upserts approach is also fine, since you can decide to recreate your index from scratch at any time.