Will Search Quality Determine the Next Generation of LLMs?

Broadly speaking, to generate an answer, an LLM uses ”memorization”, performs “research” using the web, or “thinks” by breaking a task into steps.

“Memorization” is a compressed approximation of the training dataset, similar to vague memories of a subject. It suffices for common or generic inquiries. But if a problem is niche, the training data is outdated, or the subject has high variance, the model, like a human, must do a “research” by searching and scanning web pages.

Humans determine their own stopping criteria for research, LLMs rely on probabilistic one (an internal implementation detail) while being constrained by cost and scale. These bounds include limits on query counts, the number of pages scanned, page size, context window size, and frequency caps (e.g., daily or session-based). They vary based on pay plan, model, current load, and other metrics, often independent of the complexity of the underlying problem.

In the “research” mode, an LLM resembles an undergrad: it fires off a few search queries and compiles an answer from the top results. If those results look similar enough, the model stops. This can be misleading. A flood of historical content may hide recent findings, or the information may come from non-experts. The proposed solution, while plausible based on a limited search, may be wrong.

In contrast, a PhD student performs a comprehensive search, exhausts sources, deduces solutions from partial data, asks clarifying questions, and runs experiments.

The LLM’s “thinking” (aka reasoning) mode is intended to mimic a PhD student. The problem-solving process is split into steps, and each step typically requires a “research.” This brings us back to the limits of the LLM’s web-search.

Anecdotally, working on a problem requiring extensive web search, independent sessions (thinking + research) ran within the same LLM did not always converge to the same conclusion. Moreover, exploring more pages from the sites mentioned in reference links, I was often able to catch the incorrect conclusion. I observed this with ChatGPT 5.0 and Gemini 3 Pro Preview.

The quality of an LLM’s “research” is a major determinant of its overall performance. It depends on the meaningful query construction (LLM quality), information discoverability (search engine quality), result ranking (search engine quality), availability of page snippets (search engine quality), ability to extract text chunks of a page (LLM agent/orchestration quality), and create a page summary (LLM quality).

Thus, advanced LLM performance depends heavily on the strength of the web search engine. And while there are several top-tier LLM creators, high-quality search providers are scarce. Google Search is the global leader, followed by Microsoft Bing, with Baidu and Yandex as major regional players.

The critical question is: Do these search engines have private data, functionality, or expertise that could give an edge to their own, or closely partnered, LLMs? Not only from better search quality, but also through lower operational costs.

Examples might include search-engine-aided query construction, direct communication between the LLM and the search engine without natural language translation, access to ranking internals, pre-tokenized snippets/summaries, or more effective page processing and text chunk generation.

As models shift from “memorization” to “research,” the quality of search and the related costs will directly shape the next generation of LLMs.

Will Search Quality Determine the Next Generation of LLMs?

Latest posts

Categories

Tags

Related posts

Latest posts

Categories

Tags