Search code examples
c#amazon-web-serviceselasticsearchnest

Updating a document with index strategy in Elastic Search with three calls is not efficient


I have an AWS Elastic Search server. Using mapping template and an index strategy.

{
  "index_patterns": "users*",
  "order": 6,
  "version": 6,
  "aliases": {
    "users": {}
  },
  "settings": {
    "number_of_shards": 5
  },
  "mappings": {
    "_doc": {
      "dynamic": "strict",
      "properties": {
        "id": { "type": "keyword" },
        "emailAddress": { "type": "keyword" }
      }
    }
  }
}

Index strategy is {index_patterns}-{yyyy}-{MM}-{order}-{version}

public async Task<Result> HandleEventAsync(UserChanged @event, CancellationToken cancellationToken)
{
    // 1. Get User, I could get away with this call if Index was known and strategy not used
    var userMaybe =
        await _usersRepository.GetByIdAsync(@event.AggregateId.ToString(), cancellationToken);

    if (userMaybe.HasValue)
    {
        var user = userMaybe.Value.User;

        var partialUpdate = new
        {
            name = @event.Profile.Name,
            birthDate = @event.Profile.BirthDate?.ToString("yyyy-MM-dd"),
            gender = @event.Profile.Gender.ToString(),
            updatedDate = DateTime.UtcNow,
            updatedTimestampEpochInMilliseconds = EpochGenerator.EpochTimestampInMilliseconds(),
        };

        // 2. Remove fields with NULL values (if found any)
        // 3. Partial or Full update of the document, in this case partial
        var result = await _usersRepository.UpdateAsync(user.Id, partialUpdate, userMaybe.Value.Index, cancellationToken: cancellationToken);

        return result.IsSuccess ? Result.Ok() : Result.Fail($"Failed to update User {user.Id}");
    }

    return Result.Fail("User doesn't exist");
}

So in this method I consume SQS message, I retrieve the document from Elastic Search for the reason of finding the index because I don't know it explicitly, remove any NULL fields using the below methods because serializer in update will include NULL values and then update the document partially.

This is 3 Elastic Search operation for 1 update, I understand the NULL values UpdateByQuery call can be removed with a decision to just tolerate null values in document but we might face the issue not able to query with Exists/NotExists for these fields if ever needed.

private async Task<Result> RemoveNullFieldsFromDocumentAsync(
            object document,
            string documentId,
            string indexName = null, 
            string typeName = null,
            CancellationToken cancellationToken = default)
{
    var result = Result.Ok();
    var allNullProperties = GetNullPropertyValueNames(document);
    if (allNullProperties.AnyAndNotNull())
    {
        var script = allNullProperties.Select(p => $"ctx._source.remove('{p}')").Aggregate((p1, p2) => $"{p1}; {p2};");
        result = await UpdateByQueryIdAsync(
                                        documentId, 
                                        script,
                                        indexName,
                                        typeName,
                                        cancellationToken: cancellationToken);
    }

    return result;
}

private static IReadOnlyList<string> GetNullPropertyValueNames(object document)
{
    var allPublicProperties =  document.GetType().GetProperties().ToList();

    var allObjects = allPublicProperties.Where(pi => pi.PropertyType.IsClass).ToList();

    var allNames = new List<string>();

    foreach (var propertyInfo in allObjects)
    {
        if (propertyInfo.PropertyType == typeof(string))
        {
            var isNullOrEmpty = ((string) propertyInfo.GetValue(document)).IsNullOrEmpty();
            if (isNullOrEmpty)
            {
                allNames.Add(propertyInfo.Name.ToCamelCase());
            }
        }
        else if (propertyInfo.PropertyType.IsClass)
        {
            if (propertyInfo.GetValue(document).IsNull())
            {
                allNames.Add(propertyInfo.Name.ToCamelCase());
            }
            else
            {
                var namesWithobjectName = GetNullPropertyValueNames(propertyInfo.GetValue(document))
                    .Select(p => $"{propertyInfo.PropertyType.Name.ToCamelCase()}.{p.ToCamelCase()}");
                allNames.AddRange(namesWithobjectName);
            }
        }
    }

    return allNames;
}

public async Task<Result> UpdateByQueryIdAsync(
    string documentId,
    string script,
    string indexName = null, 
    string typeName = null, 
    bool waitForCompletion= false,
    CancellationToken cancellationToken = default)
{
    Guard.Argument(documentId, nameof(documentId)).NotNull().NotEmpty().NotWhiteSpace();
    Guard.Argument(script, nameof(script)).NotNull().NotEmpty().NotWhiteSpace();

    var response = await Client.UpdateByQueryAsync<T>(
        u => u.Query(q => q.Ids(i => i.Values(documentId)))
                .Conflicts(Conflicts.Proceed)
                .Script(s => s.Source(script))
                .Refresh()
                .WaitForCompletion(waitForCompletion)
                .Index(indexName ?? DocumentMappings.IndexStrategy)
                .Type(typeName ?? DocumentMappings.TypeName), 
        cancellationToken);

    var errorMessage = response.LogResponseIfError(_logger);

    return errorMessage.IsNullOrEmpty() ? Result.Ok() : Result.Fail(errorMessage);
}

My question is, if I change the strategy to use a constant index for all users documents which they're not significant in number and will not really grow into billions at the moment, will I have a performance hit on Elastic Search, sharding/indexing etc?


Solution

  • Yes. A single index can handle a lot of data: you don't need to split them as small as you are. In fact, a small index, with small shards, is actually worse from a performance perspective since it leads to lots of shards per node, eating up heap space with overhead.

    Creating a single date-based index makes sense if you have a lot of data coming in regularly, so maybe just the index_name-yyyyMMdd pattern would work.

    Last, you can always search across all your indices using wildcards. So you could search the above by querying index_name-*. In your existing pattern, you could do the same: index_patterns-* or index_patterns-yyyy-*, etc.

    Some info around shard sizing: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster