Improved Web Search with Local LLM, SearxNG and Deno 2.0
Many are raising that Google search (with ~90% market share) has lowered in quality and that is much harder to find accurate information. Some are finding alternatives in more privacy oriented (as in less tracking and ad targeting) search engines like DuckDuckGo, Brave, Qwant, Startpage… with each of them having potential downsides of their own.
On the other hand GenAI solutions came with ability to give a lot of information about any topic, but questionably accurate. Those LLM models only have the knowledge until the date of the data they were trained upon which is normally quite outdated. To combat this limitation companies have built functionality to get data from online sources and then let LLM answer the original queries based on what it found in that data. This is better known as Retrival— Augmented Generation or RAG in short— see more and can be seen in search engines these days which are also providing a summary of search results.
This should provide the best results as we can get the relevant new web content which LLM can then ‘reason’ on.
And this is exactly what we are going to do, but instead of using fully closed solutions, we are going to build one ourselves so we can decide what we think is most accurate and relevant for us, while also keeping our data as private as possible.
For a list of similar solutions, with UIs, see Awesome AI Web Search
Web search using SearxNG
To quote their Github: SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither tracked nor profiled.
What this also means is that we don’t have to integrate with specific search provider using their specific API (even if free — e.g. Brave API for AI) but use a unified JSON API instead, while enabling any of the 210 supported engines.
Note: many search providers have their own API’s and requirements when it comes to AI usage of data so make sure to understand those
Additionally, SearxNG will rank the results based on how many search engines return those some entries. This means we don’t have to depend only on Google or Bing but rather use a personal selection, including those that don’t use the same Google or Bing index.
To run it we are going to utilize official Docker image with docker compose and dedicated Caddy SSL proxy behind cloudflare as shown in previous articles so that we can use that on our domain.
Now we can start searching the web using multiple Search engines of our choice through single web interface.
However this still requires that one has to open underlying pages, check if there is an answer or information on wanted topic, close it, and then check next results, and so on. Wouldn’t it be nice if we could actually leverage someone else to do this instead?
The AI Brain using Local LLM
Running local LLM has never been easier, no matter the platform you are on or the hardware. Obviously, the better the hardware in general the better the performance and quality of the model that can be used but new small models are released very frequently and they can be enough for only reasoning capabilities on specific topic.
Due to great experience with RX 6900 XT GPU we are using LM Studio in this case. It’s great for beginners due to the GUI but it also has ability to run in headless mode with OpenAI compatible server which is what we are going to utilize.
For the model we are using LLama 3.2 as it’s super fast with long context (128k tokens possible, 100k configured in our setup) which is exactly what we need to feed it a lot of information from the internet pages.
For the actual setup check official docs, it’s very straightforward.
Make sure to install ROCm extension for best performance on AMD GPUs.
Search Agent with Deno 2.0
After we have LLM running and Search working, we are ready to combine the two.
To do that we are going to utilize Deno 2.0 which was just made available. For those who are new to it, Deno is written in Rust by the same creator of Node.js and allows deploying secure backend applications, technically improving and replacing Node.js.
To get things going we can install it on Linux following official install docs.
One of the additions coming with 2.x release is much better compatibility with existing Node NPM which means access to millions of NPM modules. This helps as there is no more need to always search for specific Deno libraries.
So today we are going to use regular NPM package to connect to Local LLM using OpenAI library — deno install npm:openai
and parse HTML content easily with Cheerio — deno install npm:cheerio
Running as a single binary
This project is intentionally made as a simple CLI application as there are enough web projects available. Adding REST API to Deno is simple and is an exercise for the reader. For us we are going to just compile is a single binary that can be used easily anywhere.
Deno makes this very simple.
deno compile --allow-net --allow-env -o WebSearchAgent app.ts
This results in WebSearchAgent binary which has expected permissions built in preventing giving access to unexpected APIs.
After we are finished with the project getting answers from CLI will be as simple as WebSearchAgent “Who won 2024 US president election?”.
And you will be able to get a correct answer!
RAG
This is where different RAG approaches come into play to store and later retrieve the data using some form of semantic search and results ranking through usage of embedding models. However, that is usually done because there is no search of the wanted data available (think about your internal documents), while in our case we already have that through best search engines.
There isn’t globally agreed standard on the names of the approaches but some that can be found across the community are: Naive RAG, Standard RAG, Simple RAG, Complex RAG, Advanced RAG, Hybrid RAG, Modular RAG, Contextual RAG, Speculative RAG,… to name a few.
Beneficial thing for us could be additionally processing search results before storing them. We can use specific models to extract most relevant data for later retrieval. After all, articles published on the internet would be full of human language not needed when we just want accurate data. Additionally we could instruct LLM to extract text directly from HTML data and skip using Cheerio altogether.
We can also refine the user input so it better fits the search engine. Trading speed for accuracy we can leverage LLM to transform the user query into just parts needed for the web search. If we do that, from a user query: “Who won 2024 US election?”
the search is done with only “US 2024 election winner”
. This is commonly seen in commercial solutions. In simple cases it might not mean much but in more complicated one’s, where we want to use natural language and don’t want to deal with nuances of search engines, this can help a lot.
We can also change how we process the context data from the web searches. Currently we try to get as many search results as possible to fit into the context length we know our model can process. A different approach can be to first process each result one by one and check if that results matches our input query. This way the processing will be much faster and we will get results sooner. This is perfect in cases where we know that results are going to give facts. However, given that there is less context in general provided to LLM, on some topics this may mean less accurate results as they will come from just one point of view.
One of the many open-source solutions that does make web search, stores the results as documents in vector database and later searches (and optionally re-ranks them) is available using OpenWebUI with SearxNG so feel free to explore that approach further.
Retrival
In our case data retrieval comes almost for free. We are able to pass the input query directly to SearxNG and get the relevant (and ranked) websites back in structured JSON.
const result = await fetch(`https://search.xyz/search?q=${q}?format=json`)
Then comes the step of actually getting the content from those websites based on result.url
.
Note: make sure to follow robots.txt guidelines for each website
Simplest and fastest approach is to use one of the builtin HTTP libraries from Deno, like web standard Fetch API. Passing that to Cheerio we are able to extract relevant HTML content as text.
const page = await fetch(url, { headers })
const $ = cheerio.load(await page.text())
const text = $(“html”).find(“p”).text()
It’s fast but disadvantage of this approach is that it doesn’t work with JavaScript dependent websites like Single Page Applications. There are some possible solutions to it like JSDOM with their own disadvantages.
Better but slower, might be to get a full headless browser that can browse the pages as intended. Puppeteer or Playwright are common solution to this but specifically for Deno we have even simpler one with Astral.
const page = await browser.newPage(url)
const html = await page.evaluate(() => {
return document.body.innerHTML
})
const $ = cheerio.load(html)
const text = $("html").find("p")
Now to actually process results we have to make some decisions. The reason for this is that LLM context is not unlimited. With our setup we are supporting 100k tokens which means we can process around 80k words. This should be enough to be able to get content from multiple pages and send that to LLM to check if there is answer in that.
Generation
If you follow LLM space there is a high chance that you heard about prompt engineering. It’s a whole area on it’s own and it is where most of the work happens after initial setup is done.
For us, we will use this fairly simple prompt:
{ role: 'system', content: "You are a helpful search assistant" },
{ role: 'user', content: `Answer '${question}' based on following text: '${input}'`},
Our prompt instructs LLM to respond to user input from the information of all search results. With this our LLM isn't dependent only on trained data but can access and ‘reason’ about any recent published information.
There is so much more we could do here to instruct LLM to give us best results that is beyond this article. That is something that should be evaluated and adapted continuously as the LLM is being used.
As the last thing worth mentioning is temperature
parameter which we set to 0.1 as we don’t really want a lot of randomness and creativity here.
The future
When it comes to search there are few good exercises for the reader.
Language is one of those so set those up based on how you prefer searches to work. SearxNG does have ‘auto’ option when it comes to language to detect in which language to search but maybe that is not always wanted.
Secondly, a very interesting information to be given to users is to actually list the sources that were used in giving back answer to the user. With prompt and context data we can include also the url source before providing related paragraph. After that prompt should be changed so it also returns provided source of the paragraph.
Finally as we mentioned there are many steps available that can improve these results, trading speed for accuracy. We can improve the search input, check the page results being valid or repeat search multiple times. But as LLM space evolves rapidly there are many possible improvements waiting around the corner.
Equally, it’s important to note that this is all experimental. Should you be relying on LLM to provide you factual information? Not really. There is still a lot of unknowns when it comes to output generation, especially with smaller LLM models like the one we are using here. You cannot be sure that it really used the original sources, or still made something on it’s own. When in doubt, best to verify yourself manually.
Check out the full source here!
Note: not a single word of this article has been written using AI ;)