Analysing Documents with LLAVA on Ollama not working

I am currently testing LLAVA for use in Document Understanding Tasks. I found some promissing results in some scientific papers and on some websites. I installed the Model on Ollama (on Windows) and tried accessing it with this C# Code.

using System.Text;
using System.Text.Json;

public class Program
{
    private static readonly HttpClient client = new HttpClient();
    private static string? imageBase64;

    static async Task Main(string[] args)
    {
        Console.WriteLine("Welcome to the Document Analysis Application!");

        while (true)
        {
            Console.Write("Enter the path to the image file (or 'exit' to quit): ");
            string imagePath;
            do
            {
                imagePath = Console.ReadLine() ?? "";
            } while (String.IsNullOrEmpty(imagePath));

            if (imagePath.ToLower() == "exit")
                break;

            if (!File.Exists(imagePath))
            {
                Console.WriteLine("File not found. Please try again.");
                continue;
            }

            imageBase64 = Convert.ToBase64String(File.ReadAllBytes(imagePath));
            Console.WriteLine("Image loaded successfully.");

            while (true)
            {
                Console.Write("Enter your question about the document (or 'new' for a new image, 'exit' to quit): ");
                string question;
                do
                {
                    question = Console.ReadLine() ?? "";
                } while (String.IsNullOrEmpty(question));

                if (question.ToLower() == "new")
                    break;
                if (question.ToLower() == "exit")
                    return;

                Console.WriteLine("Response:");
                _ = await AnalyzeDocument(question);
                Console.WriteLine("\nEnd of response.");
            }
        }
    }

    static async Task<string> AnalyzeDocument(string question)
    {
        var requestBody = new
        {
            model = "llava:13b-v1.6",
            prompt = $"Analyze this invoice image carefully. Pay close attention to all numerical values, especially totals and subtotals. If the question is about a total or sum, make sure to double-check your calculation. After your analysis, provide a clear, concise answer to this specific question: {question}",
            images = new[] { imageBase64 },
            stream = true
        };

        var content = new StringContent(JsonSerializer.Serialize(requestBody), Encoding.UTF8, "application/json");

        try
        {
            HttpResponseMessage response = await client.PostAsync("http://localhost:11434/api/generate", content);
            response.EnsureSuccessStatusCode();

            using (var reader = new StreamReader(await response.Content.ReadAsStreamAsync()))
            {
                StringBuilder fullResponse = new StringBuilder();
                string? line;
                while ((line = await reader.ReadLineAsync()) != null)
                {
                    if (string.IsNullOrWhiteSpace(line)) continue;

                    try
                    {
                        using (JsonDocument doc = JsonDocument.Parse(line))
                        {
                            JsonElement root = doc.RootElement;
                            if (root.TryGetProperty("response", out JsonElement responseElement))
                            {
                                string responsePart = responseElement.GetString() ?? "";
                                fullResponse.Append(responsePart);
                                Console.Write(responsePart); // Print each part as it's received
                            }
                            if (root.TryGetProperty("done", out JsonElement doneElement) && doneElement.GetBoolean())
                            {
                                break;
                            }
                        }
                    }
                    catch (JsonException)
                    {
                        Console.WriteLine($"Failed to parse JSON: {line}");
                    }
                }
                return fullResponse.ToString();
            }
        }
        catch (HttpRequestException e)
        {
            return $"Error: {e.Message}";
        }
    }
}

Unfortuntely the results are really bad and are mostly hallucinations. Sometimes the LLM complains that it would need a clearer view of the document to answer the questions, i tried downscaling the image, but it still did not work, is there maybe a way to process the image in multiple chunks.

Solution

I cannot comment yet so i will be answering instead.

I have tried using Llava for Policies.pdf and stuff for RAG FAQ chatbot, however that multimodal model have also said the same thing to me, either the image is not good enough or it just takes a guess. It is able to understand images though, like dogs, or graphs (in my case it just tells me the image is a graph but cannot give details about it). Even when upscaled with pdf to image to upscaler, it still say the same thing, but can now identify more text on the image.

Ultimately to reach some weird deadline i threw that Model and implemented a pre-processing, PDF to Image to OCR. And yes it makes Llava redundant in here. But you can use other models to process the text output of OCR to reach the same "Document Understanding" task.

Another thing: PDF Documents are a pain, some of them may be all text and can be scraped easily with already available PDFpackages. But some may contain Images, or just all Scanned Images of the actual printed document. That is where Image to OCR will be of use to you. You can try other vision models or OCR packages but currently from all i've tested winocr works best for image text extraction.(yes, its using the OCR engine of windows snipping tool)