I would like to use Elasticsearch to search a git repository. That seems relatively easy if you look only at the latest main branch - you index all files as documents and then you can search through those documents to find relevant files.
But I would like to be able to search also at branches a repository might have. One easy approach is that I look at each branch as its own repository, index documents there (or each branch into its own index, or using some field to store the branch name and then filter on that).
But that makes many documents duplicated. In most cases, a branch has most files the same as the parent branch, only few files are changed. So I am asking if there is some better way to search over documents where documents can have multiple versions. So there is main
branch and then there are other branches which have a path to the main
branch, (e.g., branch feature123
has been forked from branch fix321
which has been forked from branch main
). When you search inside branch feature123
I would like that if finds files which have been modified in that branch, if a file has not been modified in that branch, it goes one branch up to fix321
to find the file and if not, it goes to main
. So feature123
branch shadows files from fix321
if they have been modified there.
I do not want search results to have duplicates (that same file is found multiple times because it "exists" in multiple branches - only the latest changed file should be found). Also, I would like that aggregations and counts work as well and ignore documents which have been shadowed.
(The git repository does not contain code but primarily regular text files. So other issues with code searching do not necessary apply here.)
In order to make it easier and faster at search time, we need to spend some time thinking about how to index the documents in a smart way.
The document structure I'm suggesting looks like this:
{
"branch": "main",
"search_branches": [
"main",
"fix123"
],
"file": "bar.txt",
"content": "main bar content",
"sha1": "123"
}
A quick explanation of the fields:
branch
is the branch where the file is locatedsearch_branches
are all the branches that this file can be searched in, i.e. all the branches where this file has the same contentfile
is the file namecontent
is the searchable file contentNow let's take the example you mentioned in your comment, i.e.:
main
branch has
foo.txt
bar.txt
fix123
branch has
foo.txt
(modified)bar.txt
(unmodified)And index them using the proposed structure:
POST git/_bulk
{"index":{}}
{"branch": "main", "search_branches": ["main"], "file": "foo.txt", "content": "main foo content"}
{"index":{}}
{"branch": "main", "search_branches": ["main", "fix123"], "file": "bar.txt", "content": "main bar content"}
{"index":{}}
{"branch": "fix123", "search_branch": "fix123", "file": "foo.txt", "content": "fix foo content"}
This structure makes all queries very simple, without the need to resort to aggregations or field collapsing. For instance, using the sample searches you mentioned in your comment, let's search for the word content
in both branches in turn:
# if I search the main branch, I want to ignore foo.txt from fix123
POST git/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "content"
}
}
],
"filter": [
{
"term": {
"search_branch": "main"
}
}
]
}
}
}
The results would only contain results from the main
branch, i.e. foo.txt
and bar.txt
from the main
branch
# if I search fix123 branch, I want to search both main's bar.txt and fix123's foo.txt
POST git/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "content"
}
}
],
"filter": [
{
"term": {
"search_branch": "fix123"
}
}
]
}
}
}
The results would only contain results from the fix123
branch, i.e. foo.txt
from the fix123
branch and bar.txt
from the main
branch
So, how do we get to that structure? It's pretty easy actually. The main concept here is that each file in the tree that contains final content should be indexed in a document of its own with search_branches
containing the name of the current branch and all other branches created from the current one and its descendants. By "final content", I mean content that hasn't changed in the child branches.
This requires some indexing logic to build up the search_branches
field in each document, but that's really the only difficult thing to do. And after that, you can search any branch in isolation, or even any set of branches.
First, I would index the main
branch, so that each file becomes a document of its own using the SHA1 as the ID. Each file would now have search_branches: main
.
Then I would visit one branch after another, and for each file, I would apply this logic:
search_branches
field in the document from the parent branchsearch_branches: branch_name
At this point, you have the guarantee that all documents, in whichever branches, have some sort of differing content that you might want to search. You also have the guarantee that all files that have the same content in different branches are contained in a document with search_branches
having the name of all branches where this same content is present.
Since we're only indexing files with different content, we can use the SHA1 fingerprint as ID. If we do so, it becomes very simple to make the initial load using scripted upserts.
Let's store the script containing our logic in the cluster state, so we don't have to provide it in each bulk element later:
POST _scripts/git_load
{
"script": {
"lang": "painless",
"source": """
if ( ctx.op == 'create' ) {
ctx._source.putAll(params);
ctx._source.search_branches = [params.branch];
} else {
ctx._source.search_branches.add(params.branch);
}
"""
}
}
The logic is pretty simple:
search_branches
fieldThen we can use the script, as follows, by providing the document content in the script params
:
POST git/_update/<sha1_id>
{
"upsert": {},
"scripted_upsert": true,
"script": {
"id": "git_load",
"params": {
"branch": "main",
"file": "foo.txt",
"content": "main foo content",
"sha1": "sha1_id"
}
}
}
For the initial bulk load, you simply have to iterate all branches and files of your repository and provide one such upsert element per document:
POST git/_bulk
{"update":{ "_id": "123"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"main","file":"foo.txt","content":"main foo content","sha1":"123"}},"upsert":{}}
{"update":{ "_id": "456"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"main","file":"bar.txt","content":"main bar content","sha1":"456"}},"upsert":{}}
{"update":{ "_id": "789"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"foo.txt","content":"fix foo content","sha1":"789"}},"upsert":{}}
{"update":{ "_id": "456"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"bar.txt","content":"main bar content","sha1":"456"}},"upsert":{}}
No code lives forever and is always getting modified, so let's talk about updates. Whenever a file changes, we need to update the document from main and create a new one for the branch in which the document has been modified.
Let's take the example of bar.txt
that we modify in branch fix123
so that it becomes different from the content in main
. We need to issue two upserts, one to modify the parent document (i.e. to remove fix123
from search_branches
in the parent document) and another one to create the child document.
For that we need the old SHA1 and the new one and modify the script a little bit. We add a case that checks if the remove_branch
parameter is present and if it's the case, we can remove it from the search_branches
field of the existing document. The rest is the same:
POST _scripts/git_load
{
"script": {
"lang": "painless",
"source": """
if ( ctx.op == 'create' ) {
ctx._source.putAll(params);
ctx._source.search_branches = [params.branch];
} else if (params.remove_branch != null) {
ctx._source.search_branches.removeIf(elem -> elem.equals(params.remove_branch));
} else {
ctx._source.search_branches.add(params.branch);
}
"""
}
}
We can now issue the two following upserts to create the new bar document in the fix123
branch. Also note that in order to update the older document, you don't need to send anything else than the remove_branch
field.
POST git/_bulk
{"update":{ "_id": "456"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"remove_branch": "fix123"}},"upsert":{}}
{"update":{ "_id": "xyz"}}
{"scripted_upsert":true,"script":{"id":"git_load","params":{"branch":"fix123","file":"bar.txt","content":"fix bar content","sha1":"xyz"}},"upsert":{}}
That's pretty much it. If your repository is not that big, I would recreate the index from scratch on every update, otherwise using the initial load + bulk upserts approach is also fine, since you can decide to recreate your index from scratch at any time.