Search code examples
c#asp.net-corelarge-language-modelsemantic-kernellm-studio

SemanticKernel GetStreamingChatMessageContentsAsync empty but GetChatMessageContentAsync works fine


I just got started with SemanticKernel on local LLM.

I got it working with the following code:

var chat = app.Services.GetRequiredService<IChatCompletionService>();
ChatMessageContent response = await chat.GetChatMessageContentAsync(chatHistory); 
var items = response.Items;
var firstitem = items.FirstOrDefault();
var textContent = firstitem as TextContent;
Console.WriteLine(textContent?.Text);

This works as expected and produces a "Hello! How can I assist you today? ??" reply

However, I want to do this like everybody else with streaming.

await foreach (StreamingChatMessageContent stream in chat.GetStreamingChatMessageContentsAsync("Hi"))
{
    // await Task.Yield(); // tried this to see if it would help but it didn't
    Console.WriteLine(stream.Content);
}

But this returns 12 "empty" results, which if serialised

{"Content":null,"Role":{"Label":"Assistant"},"ChoiceIndex":0,"ModelId":"deepseek-r1-distill-llama-8b","Metadata":{"CompletionId":"chatcmpl-m086eaeve495763ls6arwj","CreatedAt":"2025-02-13T10:22:51+00:00","SystemFingerprint":"deepseek-r1-distill-llama-8b","RefusalUpdate":null,"Usage":null,"FinishReason":null}}

followed by a "stop"

{"Content":null,"Role":null,"ChoiceIndex":0,"ModelId":"deepseek-r1-distill-llama-8b","Metadata":{"CompletionId":"chatcmpl-m086eaeve495763ls6arwj","CreatedAt":"2025-02-13T10:22:51+00:00","SystemFingerprint":"deepseek-r1-distill-llama-8b","RefusalUpdate":null,"Usage":null,"FinishReason":"Stop"}}

So I know the server is running as the direct approach works fine, but I cannot get the streaming to work properly.

For the direct message without streaming, here is the server log for the request:

2025-02-13 12:28:42 [DEBUG] 
Received request: POST to /v1/chat/completions with body  {
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ],
  "model": "deepseek-r1-distill-llama-8b"
}
2025-02-13 12:28:42  [INFO] 
[LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
2025-02-13 12:28:42 [DEBUG] 
Sampling params:    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
2025-02-13 12:28:42 [DEBUG] 
sampling: 
logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 12
BeginProcessingPrompt
2025-02-13 12:28:42 [DEBUG] 
FinishedProcessingPrompt. Progress: 100
2025-02-13 12:28:42  [INFO] 
[LM STUDIO SERVER] Accumulating tokens ... (stream = false)
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 1 tokens <think>
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 2 tokens <think>\n\n
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 3 tokens <think>\n\n</think>
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 4 tokens <think>\n\n</think>\n\n
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 5 tokens <think>\n\n</think>\n\nHello
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 6 tokens <think>\n\n</think>\n\nHello!
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 7 tokens <think>\n\n</think>\n\nHello! How
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 8 tokens <think>\n\n</think>\n\nHello! How can
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 9 tokens <think>\n\n</think>\n\nHello! How can I
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 10 tokens <think>\n\n</think>\n\nHello! How can I assist
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 11 tokens <think>\n\n</think>\n\nHello! How can I assist you
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 12 tokens <think>\n\n</think>\n\nHello! How can I assist you today
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 13 tokens <think>\n\n</think>\n\nHello! How can I assist you today?
2025-02-13 12:28:42 [DEBUG] 
Incomplete UTF-8 character. Waiting for next token (skip)
2025-02-13 12:28:42 [DEBUG] 
[deepseek-r1-distill-llama-8b] Accumulated 14 tokens <think>\n\n</think>\n\nHello! How can I assist you today? 😊
2025-02-13 12:28:42 [DEBUG] 
target model llama_perf stats:
llama_perf_context_print:        load time =    4657.69 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     350.76 ms /    16 runs   (   21.92 ms per token,    45.62 tokens per second)
llama_perf_context_print:       total time =     361.22 ms /    17 tokens
2025-02-13 12:28:42  [INFO] 
[LM STUDIO SERVER] [deepseek-r1-distill-llama-8b] Generated prediction:  {
  "id": "chatcmpl-qv5tc01fntgw2bsm3091wk",
  "object": "chat.completion",
  "created": 1739442522,
  "model": "deepseek-r1-distill-llama-8b",
  "choices": [
    {
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "<think>\n\n</think>\n\nHello! How can I assist you today? 😊"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 14,
    "total_tokens": 18
  },
  "system_fingerprint": "deepseek-r1-distill-llama-8b"
}

And with the streaming log which doesn't return any content:

2025-02-13 12:30:18 [DEBUG] 
Received request: POST to /v1/chat/completions with body  {
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ],
  "model": "deepseek-r1-distill-llama-8b",
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}
2025-02-13 12:30:18  [INFO] 
[LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
2025-02-13 12:30:18  [INFO] 
[LM STUDIO SERVER] Streaming response...
2025-02-13 12:30:18 [DEBUG] 
Sampling params:    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling: 
logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 12
BeginProcessingPrompt
2025-02-13 12:30:18 [DEBUG] 
FinishedProcessingPrompt. Progress: 100
2025-02-13 12:30:18  [INFO] 
[LM STUDIO SERVER] First token generated. Continuing to stream response..
2025-02-13 12:30:18  [INFO] 
[LM STUDIO SERVER] Received <think> - START
2025-02-13 12:30:18 [DEBUG] 
Incomplete UTF-8 character. Waiting for next token (skip)
2025-02-13 12:30:18 [DEBUG] 
target model llama_perf stats:
llama_perf_context_print:        load time =    4657.69 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     354.33 ms /    16 runs   (   22.15 ms per token,    45.16 tokens per second)
llama_perf_context_print:       total time =     365.10 ms /    17 tokens
2025-02-13 12:30:18  [INFO] 
Finished streaming response

Solution

  • This issue has been resolved. It appears there was a bug in LM Studio. I updated the LM Runtime to v.1.15.0 and now it works.

    So if anyone else encounter this problem, maybe that could steer you in the right direction.