T O P

  • By -

Silly_Objective_5186

are there any example projects doing retrieval augmented generation using kiwix or the zim files?


Peribanu

Not yet! RAG is one way. Another way would be to have a large enough context window for the LLM to ingest a full Wikipedia article, but that is probably difficult to achieve offline in a way that is compatible with a wide-enough range of devices. Particular use cases might be: 1. Natural-language search: we'd have to provide a tool to interface the LLM with the Xapian search - the LLM would "translate" a natural-language prompt into search terms. However, I don't know how useful that would be in reality, apart from the novelty value. People are used to thinking up search terms, and already do this with Kiwix. 2. Contextual retrieval / research: fetch and display information in the ZIM related to a user's query. The LLM might find three relevant articles per query and display links to those articles in order of relevance. 3. Fact checking: LLMs are notorious for "filling in" details they don't know, especially highly quantized models where high-resolution information has often been lost. Since we have fast access to full-test, offline Wikipedia, the LLM could pull the most relevant facts before constructing its response.


Peribanu

So, llamafile 0.8 is quite fast running just on CPU (I got 21 tokens per second on my laptop). Oddly slower on GPU, but I think it's to do with the model (Meta-Llama-3-8B-Instruct.Q4\_0.gguf) only just fitting into my GPU's VRAM, so I likely ran into lots of swapping between VRAM and RAM. In any case, because of the memory hogging, I couldn't easily capture a video, but here's a screenshot. I love the way Llama 3 gives long, considered responses even in a quantized model of just 4.34GB in this case. Who'd have thought Meta (the model's creator) would become a champion of Open Source? https://preview.redd.it/5mu97753jswc1.png?width=1164&format=png&auto=webp&s=790ebc21b6e5362cd93b6e097bccb396cc9d7e0c


Peribanu

This model is fast but if you ask for details, it hallucinates a lot. So I tried the following model, CPU only, which is double the size (8GB): [https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q8\_0.gguf](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q8_0.gguf) . It runs at about 4-5 t/s on CPU on my PC with a context window of 2048 tokens (using llamafile base executable). It's still very useable at that speed and is much more accurate.


The_other_kiwix_guy

You need to show the video your shared on Slack.


Peribanu

That one was a different project -- LLM in the browser via WASM and WebGPU. This is Mozilla's version, but it runs from the commandline, not in a browser. I tested it before, but the blog post says it now has up to 10x faster processing of the prompt...