I just got started with SemanticKernel on local LLM.
I got it working with the following code:
var chat = app.Services.GetRequiredService<IChatCompletionService>();
ChatMessageContent response = await chat.GetChatMessageContentAsync(chatHistory);
var items = response.Items;
var firstitem = items.FirstOrDefault();
var textContent = firstitem as TextContent;
Console.WriteLine(textContent?.Text);
This works as expected and produces a "Hello! How can I assist you today? ??" reply
However, I want to do this like everybody else with streaming.
await foreach (StreamingChatMessageContent stream in chat.GetStreamingChatMessageContentsAsync("Hi"))
{
// await Task.Yield(); // tried this to see if it would help but it didn't
Console.WriteLine(stream.Content);
}
But this returns 12 "empty" results, which if serialised
{"Content":null,"Role":{"Label":"Assistant"},"ChoiceIndex":0,"ModelId":"deepseek-r1-distill-llama-8b","Metadata":{"CompletionId":"chatcmpl-m086eaeve495763ls6arwj","CreatedAt":"2025-02-13T10:22:51+00:00","SystemFingerprint":"deepseek-r1-distill-llama-8b","RefusalUpdate":null,"Usage":null,"FinishReason":null}}
followed by a "stop"
{"Content":null,"Role":null,"ChoiceIndex":0,"ModelId":"deepseek-r1-distill-llama-8b","Metadata":{"CompletionId":"chatcmpl-m086eaeve495763ls6arwj","CreatedAt":"2025-02-13T10:22:51+00:00","SystemFingerprint":"deepseek-r1-distill-llama-8b","RefusalUpdate":null,"Usage":null,"FinishReason":"Stop"}}
So I know the server is running as the direct approach works fine, but I cannot get the streaming to work properly.
For the direct message without streaming, here is the server log for the request:
2025-02-13 12:28:42 [DEBUG]
Received request: POST to /v1/chat/completions with body {
"messages": [
{
"role": "user",
"content": "Hi"
}
],
"model": "deepseek-r1-distill-llama-8b"
}
2025-02-13 12:28:42 [INFO]
[LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
2025-02-13 12:28:42 [DEBUG]
Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
2025-02-13 12:28:42 [DEBUG]
sampling:
logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 12
BeginProcessingPrompt
2025-02-13 12:28:42 [DEBUG]
FinishedProcessingPrompt. Progress: 100
2025-02-13 12:28:42 [INFO]
[LM STUDIO SERVER] Accumulating tokens ... (stream = false)
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 1 tokens <think>
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 2 tokens <think>\n\n
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 3 tokens <think>\n\n</think>
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 4 tokens <think>\n\n</think>\n\n
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 5 tokens <think>\n\n</think>\n\nHello
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 6 tokens <think>\n\n</think>\n\nHello!
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 7 tokens <think>\n\n</think>\n\nHello! How
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 8 tokens <think>\n\n</think>\n\nHello! How can
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 9 tokens <think>\n\n</think>\n\nHello! How can I
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 10 tokens <think>\n\n</think>\n\nHello! How can I assist
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 11 tokens <think>\n\n</think>\n\nHello! How can I assist you
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 12 tokens <think>\n\n</think>\n\nHello! How can I assist you today
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 13 tokens <think>\n\n</think>\n\nHello! How can I assist you today?
2025-02-13 12:28:42 [DEBUG]
Incomplete UTF-8 character. Waiting for next token (skip)
2025-02-13 12:28:42 [DEBUG]
[deepseek-r1-distill-llama-8b] Accumulated 14 tokens <think>\n\n</think>\n\nHello! How can I assist you today? 😊
2025-02-13 12:28:42 [DEBUG]
target model llama_perf stats:
llama_perf_context_print: load time = 4657.69 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 350.76 ms / 16 runs ( 21.92 ms per token, 45.62 tokens per second)
llama_perf_context_print: total time = 361.22 ms / 17 tokens
2025-02-13 12:28:42 [INFO]
[LM STUDIO SERVER] [deepseek-r1-distill-llama-8b] Generated prediction: {
"id": "chatcmpl-qv5tc01fntgw2bsm3091wk",
"object": "chat.completion",
"created": 1739442522,
"model": "deepseek-r1-distill-llama-8b",
"choices": [
{
"index": 0,
"logprobs": null,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "<think>\n\n</think>\n\nHello! How can I assist you today? 😊"
}
}
],
"usage": {
"prompt_tokens": 4,
"completion_tokens": 14,
"total_tokens": 18
},
"system_fingerprint": "deepseek-r1-distill-llama-8b"
}
And with the streaming log which doesn't return any content:
2025-02-13 12:30:18 [DEBUG]
Received request: POST to /v1/chat/completions with body {
"messages": [
{
"role": "user",
"content": "Hi"
}
],
"model": "deepseek-r1-distill-llama-8b",
"stream": true,
"stream_options": {
"include_usage": true
}
}
2025-02-13 12:30:18 [INFO]
[LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
2025-02-13 12:30:18 [INFO]
[LM STUDIO SERVER] Streaming response...
2025-02-13 12:30:18 [DEBUG]
Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling:
logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 12
BeginProcessingPrompt
2025-02-13 12:30:18 [DEBUG]
FinishedProcessingPrompt. Progress: 100
2025-02-13 12:30:18 [INFO]
[LM STUDIO SERVER] First token generated. Continuing to stream response..
2025-02-13 12:30:18 [INFO]
[LM STUDIO SERVER] Received <think> - START
2025-02-13 12:30:18 [DEBUG]
Incomplete UTF-8 character. Waiting for next token (skip)
2025-02-13 12:30:18 [DEBUG]
target model llama_perf stats:
llama_perf_context_print: load time = 4657.69 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 354.33 ms / 16 runs ( 22.15 ms per token, 45.16 tokens per second)
llama_perf_context_print: total time = 365.10 ms / 17 tokens
2025-02-13 12:30:18 [INFO]
Finished streaming response
This issue has been resolved. It appears there was a bug in LM Studio. I updated the LM Runtime to v.1.15.0 and now it works.
So if anyone else encounter this problem, maybe that could steer you in the right direction.