c#.net large-language-model onnx onnxruntime

Microsoft.ML.OnnxRuntimeGenAI parallelism performance

at first I want to say that I am newbie in ML at all and in .NET ML specifically. I have local phi-3 model in ONNX and .NET 8 API application

// Program.cs
    
var builder = WebApplication.CreateBuilder(args);
var modelDirectory = @"path-to-local-model-directory";

builder
  .Services
  .AddSingleton(new Model(modelDirectory))
  .AddSingleton<Tokenizer>();

builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();

var app = builder.Build();
app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();
app.Run();

And controller that return response from model:

//TestController.cs

[ApiController]
[Route("[controller]")]
public class TestController(Model model, Tokenizer tokenizer) : ControllerBase
{
  [HttpPost("generate")]
  public string Generate([FromBody] Dto dto)
  {
    using var tokens = tokenizer.Encode(dto.test);

    using var generatorParams = new GeneratorParams(model);
    generatorParams.SetSearchOption("max_length", 2048);
    generatorParams.SetInputSequences(tokens);

    var result = new StringBuilder();
    using var generator = new Generator(model, generatorParams);

    while (!generator.IsDone())
    {
      generator.ComputeLogits();
      generator.GenerateNextToken();
      var outputTokens = generator.GetSequence(0);
      var newToken = outputTokens.Slice(outputTokens.Length - 1, 1);
      result.Append(tokenizer.Decode(newToken));
    }

    return result.ToString();
  }
}

public class Dto
{
  public string test { get; set; }
}

This code worked well, but interested for me were how this application can process parallel requests, and I have this results:

Number of parallel requests	Total time to complete all requests	Average request complete time	RAM using	Processor using
1	00:01:50.30	00:01:50.18	4.3 Gb	< 20%
2	00:03:07.82	00:03:07.64	5.4 Gb	~ 25 %
3	00:04:19.06	00:04:18.73	6.6 Gb	~ 30%
4	00:05:36.74	00:05:35.96	7.8 Gb	~ 50%

What we see, request really did in parallel, because total time to complete equal average request time. Also I understand increasing of RAM and CPU usage when number of parallel requests increased. But I can`t understand why time of each request increased ?

Also maybe possible to use Microsoft.ML.OnnxRuntimeGenAI in different way to increase performance in parallel requests ?

Solution

I think I finded answer here https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

In this article I find that for LLM memory bandwidth is key, that explains why parallel requests increased time duration of each requests without CPU throttling. In this case memory bandwidth are bottleneck and parallel processes are competing for access to RAM