Search code examples
c#.netlarge-language-modelonnxonnxruntime

Microsoft.ML.OnnxRuntimeGenAI parallelism performance


at first I want to say that I am newbie in ML at all and in .NET ML specifically. I have local phi-3 model in ONNX and .NET 8 API application

// Program.cs
    
var builder = WebApplication.CreateBuilder(args);
var modelDirectory = @"path-to-local-model-directory";

builder
  .Services
  .AddSingleton(new Model(modelDirectory))
  .AddSingleton<Tokenizer>();

builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();

var app = builder.Build();
app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();
app.Run();

And controller that return response from model:

//TestController.cs

[ApiController]
[Route("[controller]")]
public class TestController(Model model, Tokenizer tokenizer) : ControllerBase
{
  [HttpPost("generate")]
  public string Generate([FromBody] Dto dto)
  {
    using var tokens = tokenizer.Encode(dto.test);

    using var generatorParams = new GeneratorParams(model);
    generatorParams.SetSearchOption("max_length", 2048);
    generatorParams.SetInputSequences(tokens);

    var result = new StringBuilder();
    using var generator = new Generator(model, generatorParams);

    while (!generator.IsDone())
    {
      generator.ComputeLogits();
      generator.GenerateNextToken();
      var outputTokens = generator.GetSequence(0);
      var newToken = outputTokens.Slice(outputTokens.Length - 1, 1);
      result.Append(tokenizer.Decode(newToken));
    }

    return result.ToString();
  }
}

public class Dto
{
  public string test { get; set; }
}

This code worked well, but interested for me were how this application can process parallel requests, and I have this results:

Number of parallel requests Total time to complete all requests Average request complete time RAM using Processor using
1 00:01:50.30 00:01:50.18 4.3 Gb < 20%
2 00:03:07.82 00:03:07.64 5.4 Gb ~ 25 %
3 00:04:19.06 00:04:18.73 6.6 Gb ~ 30%
4 00:05:36.74 00:05:35.96 7.8 Gb ~ 50%

What we see, request really did in parallel, because total time to complete equal average request time. Also I understand increasing of RAM and CPU usage when number of parallel requests increased. But I can`t understand why time of each request increased ?

Also maybe possible to use Microsoft.ML.OnnxRuntimeGenAI in different way to increase performance in parallel requests ?


Solution

  • I think I finded answer here https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

    In this article I find that for LLM memory bandwidth is key, that explains why parallel requests increased time duration of each requests without CPU throttling. In this case memory bandwidth are bottleneck and parallel processes are competing for access to RAM