We're using MLX to fine tune a model fetched from hugging face.
from transformers import AutoModel
model = AutoModel.from_pretrained('deepseek-ai/deepseek-coder-6.7b-instruct')
We fine tuned the model with command like python -m mlx_lm.lora --config lora_config.yaml
and the config file looks like:
# The path to the local model directory or Hugging Face repo.
model: "deepseek-ai/deepseek-coder-6.7b-instruct"
# Save/load path for the trained adapter weights.
adapter_path: "adapters"
When the adapter files generated after fine tuning, we evaluated the model by scripts like
from mlx_lm.utils import *
model,tokenizer = load(path_or_hf_repo ="deepseek-ai/deepseek-coder-6.7b-instruct",
adapter_path = "adapters" # path to new trained adaptor
)
text = "Tell sth about New York"
response = generate(model, tokenizer, prompt=text, verbose=True, temp=0.01, max_tokens=100)
and it works as expected.
However, after we saved the model and evaluated with mlx_lm.generate, the model worked poor. (the behavior is completely different from invoking the model with generate(model, tokenizer, prompt=text, verbose=True, temp=0.01, max_tokens=100)
.
mlx_lm.fuse --model "deepseek-ai/deepseek-coder-6.7b-instruct" --adapter-path "adapters" --save-path new_model
mlx_lm.generate --model new_model --prompt "Tell sth about New York" --adapter-path "adapters" --temp 0.01
Once you fuse the model you don't want to specify the adapter path otherwise it will try to add adapters to an already fused model (which is a bug).
Try using:
mlx_lm.generate --model new_model --prompt "Tell sth about New York" --temp 0.01
Also fusing can cause some degradation. The adapted weights are: W = W + scale * b^T a
. When you fuse b^T a
into W
it can be destructive if the adapter (b^T a
) has very different magnitude than the base weights (W
), particularly when using quantized or low precision base weights.
Tuning the scale
parameter can improve the model performance after fusion.