PPC – Quality rater and algorithmic evaluation systems: Are major changes coming?

PPC - Quality rater and algorithmic evaluation systems: Are major changes coming?


Crowd-sourced human quality raters have been the mainstay of the algorithmic evaluation process for search engines for decades. Still, a potential sea-change in research and production implementation could be on the horizon. 

Recent groundbreaking research by Bing (with some purported commercial implementation already) and a sharp uptick in closely related information retrieval research by others, indicates some big shake-ups are coming.

These shake-ups may have far-reaching consequences for both the armies of quality raters and potentially the frequency of algorithmic updates we see go live, too. 

The importance of evaluation

In addition to crawling, indexing, ranking and result serving for search engines is the important process of evaluation. 

How well does a current or proposed search result set or experimental design align with the notoriously subjective notion of relevance to a given query, at a given time, for a given search engine user’s contextual information needs?

Since we know relevance and intent for many queries are always changing, and how users prefer to consume information evolves, search result pages also need to change to meet both the searcher’s intent and preferred user interface. 

Some changes have predictable, temporal and periodic query intent shifts. For example, in the period approaching Black Friday, many queries typically considered informational might take sweeping commercial intent shifts. Similarly, a transport query like [Liverpool Manchester] might shift to a sports query on local match derby days. 

In these instances, an ever-expanding legacy of historical data supports a high probability of what users consider more meaningful results, albeit temporarily. These levels of confidence likely make seasonal or other predictable periodic results and temporary UI design shifting relatively straightforward adjustments for search engines to implement.

However, when it comes to broader notions of evolving “relevance” and “quality,” and for the purposes of experimental design changes too, search engines must know a proposed change in rankings after development by search engineers is truly better and more precise to information needs, than the present results generated. 

Evaluation is an important stage in search results evolution and vital to providing confidence in proposed changes – and substantial data for any adjustments (algorithmic tuning) to the proposed “systems,” if required. 

Evaluation is where humans “enter the loop” (offline and online) to provide feedback in various ways before roll-outs to production environments.

This is not to say evaluation is not a continuous part of production search. It is. However, an ongoing judgment of existing results and user activity will likely evaluate how well an implemented change continues to fare in production against an acceptable relevance (or satisfaction) based metric range. A metric range based on the initial human judge-submitted relevance evaluations.

In a 2022 paper titled, “The crowd is made of people: Observations from large-scale crowd labelling,” Thomas et al., who are researchers from Bing, allude to the ongoing use of such metric ranges in a production environment when referencing a monitored component of web search “evaluated in part by RBP-based scores, calculated daily over tens of thousands of judge-submitted labels.” (RBP stands for Rank-Biased Precision).

Human-in-the-loop (HITL)

Data labels and labeling

An important point before we continue. I will mention labels and labeling a lot throughout this piece, and a clarification about what is meant by labels and labeling will make the rest of this article much easier to understand:

I will provide you with a couple of real-world examples most people will be familiar with for breadth of audience understanding before continuing:

  • Have you ever checked a Gmail account and marked something as spam?
  • Have you ever marked a film on Netflix as “Not for me,” “I like this,” or “love this”?

All of these submitted actions by you create data labels used by search engines or in information retrieval systems. Yes, even Netflix has a huge foundation in information retrieval and a great information retrieval research team tool. (Note that Netflix is both information retrieval with a strong subset of that field, called “recommender systems.”)

By marking “Not for me” on a Netflix film, you submitted a data label. You became a data labeler to help the “system” understand more about what you like (and also what people similar to you like) and to help Netflix train and tune their recommender systems further.

Data labels are all around us. Labels markup data so it can be transformed into mathematical forms for measurement at scale. 

Enormous amounts of these labels and “labeling” in the information retrieval and machine learning space are used as training data for machine learning. 

“This image has been labeled as a cat.” 

“This image has been labeled as a dog… cat… dog… dog… dog… cat,” and so on. 

All of the labels help machines learn what a dog or a cat looks like with enough examples of images marked as cats or dogs.

Labeling is not new; it’s been around for centuries, since the first classification of items took place. A label was assigned when something was marked as being in a “subset” or “set of things.” 

Anything “classified” has effectively had a label attached to it, and the person who marked the item as belonging to that particular classification is considered the labeler.

But moving forward to recent times, probably the best-known data labeling example is that of reCAPTCHA. Every time we select the little squares on the image grid, we add labels, and we are labelers. 

We, as humans, “enter the loop” and provide feedback and data.

With that explanation out of the way, let us move on to the different ways data labels and feedback are acquired, and in particular, feedback for “relevance” to queries to tune algorithms or evaluate experimental design by search engines.

Implicit and explicit evaluation feedback

While Google refers to their evaluation systems in documents meant for the non-technical audience overall as “rigorous testing,” human-in-the-loop evaluations in information retrieval widely happen through implicit or explicit feedback.

Implicit feedback

With implicit feedback, the user isn’t actively aware they provide feedback. The many live search traffic experiments (i.e., tests in the wild) search engines carry out on tiny segments of real users (as small as 0.1%), and subsequent analysis of click data, user scrolling, dwell time and result skipping, fall into the category of implicit feedback. 

In addition to live experiments, the ongoing general click, scroll and browse behavior of real search engine users can also constitute implicit feedback and likely feed into “Learning to Rank (LTR) machine learning” click models. 

This, in turn, feeds into rationales for proposed algorithmic relevance changes, as non-temporal searcher behavior shifts and world changes lead to unseen queries and new meanings for queries. 

There is the age-old SEO debate around whether rankings change immediately before further evaluation from implicit click data. I will not cover that here other than to say there is considerable awareness of the huge bias and noise that comes with raw click data in the information retrieval research space and the huge challenges in its continuous use in live environments. Hence, the many pieces of research work around proposed click models for unbiased learning to rank and learning to rank with bias.

Regardless, it is no secret overall in information retrieval how important click data is for evaluation purposes. There are countless papers and even IR books co-authored by Google research team members, such as “Click Models for Web Search” (Chuklin and De Rijke, 2022). 

Google also openly states in their “rigorous testing” article:

“We look at a very long list of metrics, such as what people click on, how many queries were done, whether queries were abandoned, how long it took for people to click on a result and so on.”

And so a cycle continues. Detected change needed from Learning to Rank, click model application, engineering, evaluation, detected change needed, click model application, engineering, evaluation, and so forth.

Explicit feedback

In contrast to implicit feedback from unaware search engine users (in live experiments or in general use), explicit feedback is derived from actively aware participants or relevance labelers. 

The purpose of this relevance data collection is to mathematically roll it up and adjust overall proposed systems.

A gold standard of relevance labeling – considered near to a ground truth (i.e., the reality of the real world) of intent to query matching – is ultimately sought. 

There are various ways in which a gold standard of relevance labeling is gathered. However, a silver standard (less precise than gold but more widely available data) is often acquired (and accepted) and likely used to assist in further tuning.

Explicit feedback takes four main formats. Each has its advantages and disadvantages, largely about relevance labeling quality (compared with gold standard or ground truth) and how scalable the approach is.

Real users in feedback sessions with user feedback teams

Search engine user research teams and real users provided with different contexts in different countries collaborate in user feedback sessions to provide relevance data labels for queries and their intents. 

This format likely provides near to a gold standard of relevance. However, the method is not scalable due to its time-consuming nature, and the number of participants could never be anywhere near representative of the wider search population at large.

True subject matter experts / topic experts / professional annotators

True subject matter experts and professional relevance assessors provide relevance for query mappings annotated to their intents in data labeling, including many nuanced cases. 

Since these are the authors of the query to intent mappings, they know the exact intent, and this type of labeling is likely considered near to a gold standard. However, this method, similar to the user feedback research teams format, is not scalable due to the sparsity of relevance labels and, again, the time-consuming nature of this process. 

This method was more widely used before introducing the more scalable approach of crowd-sourced human quality raters (to follow) in recent times.

Search engines simply ask real users whether something is relevant or helpful

Real search engine users are actively asked whether a search result is helpful (or relevant) by search engines and consciously provide explicit binary feedback in the form of yes or no responses with recent “thumbs up” design changes spotted in the wild.

Crowd-sourced human quality raters

The main source of explicit feedback comes from “the crowd.” Major search engines have huge numbers of crowd-sourced human quality raters provided with some training and handbooks and hired through external contractors working remotely worldwide. 

Google alone has a purported 16,000 such quality raters. These crowd-sourced relevance labelers and the programs they are part of are referred to differently by each search engine. 

Google refers to its participants as “quality raters” in the Quality Raters Program, with the third-party contractor referring to Google’s web search relevance program as “Project Yukon.” 

Bing refers to their participants as simply “judges” in the Human Relevance System (HRS), with third-party contractors referring to Bing’s project as simply “Web Content Assessor.” 

Despite these differences, participants’ purposes are primarily the same. The role of the crowd-sourced human quality rater is to provide synthetic relevance labels emulating search engine users across the world as part of explicit algorithmic feedback. Feedback often takes the form of a side-by-side (pairwise) comparison of proposed changes versus either existing systems or alongside other proposed system changes. 

Since much of this is considered offline evaluation, it isn’t always live search results that are being compared but also images of results. And it isn’t always a pairwise comparison, either. 

These are just some of the many different types of tasks that human quality raters carry out for evaluation, and data labeling, via third-party contractors. The relevance judges likely continuously monitor after the proposed change roll-out to production search, too. (For example, as the aforementioned Bing research paper alludes to.)

Whatever the method of feedback acquisition, human-in-the-loop relevance evaluations (either implicit or explicit) play a significant role before the many algorithmic updates (Google launched over 4,700 changes in 2022 alone, for example), including the now increasingly frequent broad core updates, which ultimately appear to be an overall evaluation of fundamental relevance revisited.


Get the daily newsletter search marketers rely on.

<input type="hidden" name="utmMedium" value="“>
<input type="hidden" name="utmCampaign" value="“>
<input type="hidden" name="utmSource" value="“>
<input type="hidden" name="utmContent" value="“>
<input type="hidden" name="pageLink" value="“>
<input type="hidden" name="ipAddress" value="“>

Processing…Please wait.

See terms.

function getCookie(cname) {
let name = cname + “=”;
let decodedCookie = decodeURIComponent(document.cookie);
let ca = decodedCookie.split(‘;’);
for(let i = 0; i <ca.length; i++) {
let c = ca[i];
while (c.charAt(0) == ' ') {
c = c.substring(1);
}
if (c.indexOf(name) == 0) {
return c.substring(name.length, c.length);
}
}
return "";
}
document.getElementById('munchkinCookieInline').value = getCookie('_mkto_trk');


Relevance labeling at a query level and a system level

Despite the blog posts we have seen alerting us to the scary prospect of human quality raters visiting our site via referral traffic analysis, naturally, in systems built for scale, individual results of quality rater evaluations at a page level, or even at an individual rater level have no significance on their own. 

Human quality raters do not judge websites or webpages in isolation 

Evaluation is a measurement of systems, not web pages – with “systems” meaning the algorithms generating the proposed changes. All of the relevance labels (i.e., “relevant,” “not relevant,” “highly relevant”) provided by labelers roll up to a system level. 

“We use responses from raters to evaluate changes, but they don’t directly impact how our search results are ranked.”

– “How our Quality Raters make Search results better,” Google Search Help

In other words, while relevance labeling doesn’t directly impact rankings, aggregated data labeling does provide a means to take an overall (average) measurement of how well a proposed algorithmic change (system) might be, more precisely relevant (when ranked), with lots of reliance on various types of algorithmic averages.

Query-level scores are combined to determine system-level scores. Data from relevance labels is turned into numerical values and then into “average” precision metrics to “tune” the proposed system further before any roll-out to search engine users more broadly. 

How far from the expected average precision metrics engineers hoped to achieve with the proposed change is the reality when ‘humans enter the loop’?

While we cannot be entirely sure of the metrics used on aggregated data labels when everything is turned into numerical values for relevance measurement, there are universally recognized information retrieval ranking evaluation metrics in many research papers. 

Most authors of such papers are search engine engineers, academics, or both. Production follows research in the information retrieval field, of which all web search is a part.

Such metrics are order-aware evaluation metrics (where the ranked order of relevance matters, and weighting, or “punishing” of the evaluation if the ranked-order is incorrect). These metrics include:

  • Mean reciprocal rank (MRR).
  • Rank-biased precision (RBP).
  • Mean average precision (MAP).
  • Normalized and un-normalized discounted cumulative gain (NDCG and DCG respectively).

In a 2022 research paper co-authored by a Google research engineer, NDCG and AP (average precision) are referred to as a norm in the evaluation of pairwise ranking results:

“A fundamental step in the offline evaluation of search and recommendation systems is to determine whether a ranking from one system tends to be better than the ranking of a second system. This often involves, given item-level relevance judgments, distilling each ranking into a scalar evaluation metric, such as average precision (AP) or normalized discounted cumulative gain (NDCG). We can then say that one system is preferred to another if its metric values tend to be higher.”

– “Offline Retrieval Evaluation Without Evaluation Metrics,” Diaz and Ferraro, 2022

Information on DCG, NDCG, MAP, MRR and their commonality of use in web search evaluation and ranking tuning is widely available.

Victor Lavrenko, a former assistant professor at the University of Edinburgh, also describes one of the more common evaluation metrics, mean average precision, well:

“Mean Average Precision (MAP) is the standard single-number measure for comparing search algorithms. Average precision (AP) is the average of … precision values at all ranks where relevant documents are found. AP values are then averaged over a large set of queries…”

So it’s literally all about the averages judges submit from the curated data labels distilled into a consumable numerical metric versus the predicted averages hoped for after engineering and then tuning the ranking algorithms further.

Quality raters are simply relevance labelers

Quality raters are simply relevance labelers, classifying and feeding a huge pipeline of data, rolled up and turned into numerical scores for:

  • Aggregation on whether a proposed change is near an acceptable average level of relevance precision or user satisfaction.
  • Or determining whether the proposed change needs further tuning (or total abandonment).

The sparsity of relevance labeling causes a bottleneck

Regardless of the evaluation metrics used, the initial data is the most important part of the process (the relevance labels) since, without labels, no measurement via evaluation can take place.

A ranking algorithm or proposed change is all very well, but unless “humans enter the loop” and determine whether it is relevant in evaluation, the change likely won’t happen.

For the past couple of decades, in information retrieval widely, the main pipeline of this HITL-labeled relevance data has come from crowd-sourced human quality raters, which replaced the use of the professional (but fewer in numbers) expert annotators as search engines (and their need for speedy iteration) grew. 

Feeding yays and nays in turn converted into numbers and averages in order to tune search systems.

But scale (and the need for more and more relevance labeled data) is increasingly problematic, and not just for search engines (even despite these armies of human quality raters). 

The scalability and sparsity issue of data labeling presents a global bottleneck and the classic “demand outstrips supply” challenge.

Widespread demand for data labeling has grown phenomenally due to the explosion in machine learning in many industries and markets. Everyone needs lots and lots of data labeling. 

Recent research by consulting firm Grand View Research illustrates the huge growth in market demand, reporting:

“The global data collection and labeling market size was valued at $2.22 billion in 2022 and it is expected to expand at a compound annual growth rate of 28.9% from 2023 to 2030, with the market then expected to be worth $13.7 billion.”

This is very problematic. Particularly in increasingly competitive arenas such as AI-driven generative search with the effective training of large language models requiring huge amounts of labeling and annotations of many types.

Authors at Deepmind, in a 2022 paper, state:

 “We find current large language models are significantly undertrained, a consequence of the recent focus on scaling language models while keeping the amount of training data constant. …we find for compute-optimal training …for every doubling of model size the number of training tokens should also be doubled.” 

– “Training Compute-Optimal Large Language Models,” Hoffman et al. 

When the amount of labels needed grows quicker than the crowd can reliably produce them, a bottleneck in scalability for relevance and quality via rapid evaluation on production roll-outs can occur. 

Lack of scalability and sparsity do not fit well with speedy iterative progress

Lack of scalability was an issue when search engines moved away from the industry norm of professional, expert annotators and toward the crowd-sourced human quality raters providing relevance labels, and scale and data sparsity is once again a major issue with the status quo of using the crowd. 

Some problems with crowd-sourced human quality raters

In addition to the lack of scale, other issues come with using the crowd. Some of these relate to human nature, human error, ethical considerations and reputational concerns.

While relevance remains largely subjective, crowd-sourced human quality raters are provided with, and tested on, lengthy handbooks, in order to determine relevance. 

Google’s publicly available Quality Raters Guide is over 160 pages long, and Bing’s Human Relevance Guidelines is “reported to be over 70 pages long,” per Thomas et al.

Bing is much more coy with their relevance training handbooks. Still, if you root around, as I did when researching this piece, you can find some of the documentation with incredible detail on what relevance means (in this instance for local search), which looks like one of their judging guidelines in the depths online.

Efforts are made in this training to instill a mindset appreciative of the evaluator’s role as a “pseudo” search engine user in their natural locale. 

The synthetic user mindset needs to consider many factors when emulating real users with different information needs and expectations. 

These needs and expectations depend on several factors beyond simply their locale, including age, race, religion, gender, personal opinion and political affiliation. 

The crowd is made of people

Unsurprisingly, humans are not without their failings as relevance data labelers.

Human error needs no explanation at all and bias on the web is a known concern, not just for search engines but more generally in search, machine learning, and AI overall. Hence, the dedicated “responsible AI” field emerges in part to deal with combatting baked-in biases in machine learning and algorithms. 

However, findings in the 2022 large-scale study by Thomas et al., Bing researchers, highlight factors leading to reduced precision relevance labeling going beyond simple human error and traditional conscious or unconscious bias.

Even despite the training and handbooks, Bing’s findings, derived from “hundreds of millions of labels, collected from hundreds of thousands of workers as a routine part of search engine development,” underscore some of the less obvious factors, more akin to physiological and cognitive factors and contributing to a reduction in precision quality in relevance labeling tasks, and can be summarised as follows:

  • Task-switching: Corresponded directly with a decline in quality of relevance labeling, which was significant as only 28% of participants worked on a single task in a session with all others moving between tasks. 
  • Left side bias: In a side-by-side comparison, a result displayed on the left side was more likely to be selected as relevant when compared with results on the right side. Since pair-wise analysis by search engines is widespread, this is concerning.
  • Anchoring: Played a part in relevance labeling choices, whereby the relevance label assigned on the first result by a labeler is also much more likely to be the relevance label assigned for the second result. This same label selection appeared to have a descending probability of selection in the first 10 evaluated queries in a session. After 10 evaluated queries, the researchers found that the anchoring issue seemed to disappear. In this instance the labeler hooks (anchors) onto the first choice they make and since they have no real notion of relevance or context at that time, the probability of them choosing the same relevance label with the next option is high. This phenomenon disappears as the labeler gathers more information from subsequent pairwise sets to consider.
  • General fatigue of crowd-workers played a part in reduced precision labeling.
  • General disagreement between judges on which one of a pairwise result was relevant from the two options. Simply differing opinions and perhaps a lack of true understanding of the context of the intended search engine user.
  • Time of day and day of week when labeling was carried out by evaluators also plays a role. The researchers noted some related findings which appeared to correlate with spikes in reduced relevance labeling accuracy when regional celebrations were underway, and might have easily been considered simple human error, or noise, if not explored more fully.

The crowd is not perfect at all.

A dark side of the data labeling industry

Then there is the other side of the use of human crowd-sourced labelers, which concerns society as a whole. That of low-paid “ghost workers” in emerging economies employed to label data for search engines and others in the tech and AI industry.

Major online publications increasingly draw attention to this issue with headlines like:

  • “Millions of Workers Are Training AI Models for Pennies” (Wired, October 2023)
  • MIT Technology Review’s 2022 headline claims the AI industry benefits from economic catastrophe in non-Western economies to scale data label collection.
  • Time Magazine’s January 2023 piece reports on OpenAI’s usage of Kenyan workers on less than $2 per hour to make ChatGPT less toxic using labeling services.

And, we have Google’s own third-party quality raters protesting for higher pay as recently as February 2023, with claims of “poverty wages and no benefits.”

Add together all of this with the potential for human error, bias, scalability concerns with the status quo, the subjectivity of “relevance,” the lack of true searcher context at the time of query and the inability to truly determine whether a query has a navigational intent.

And we have not even touched upon the potential minefield of regulations and privacy concerns around implicit feedback.

How to deal with lack of scale and “human issues”?

Enter large language models (LLMs), ChatGPT and increasing use of machine-generated synthetic data.

Is the time right to look at replacing ‘the crowd’?

A 2022 research piece from “Frontiers of Information Access Experimentation for Research and Education” involving several respected information retrieval researchers explores the feasibility of replacing the crowd, illustrating the conversation is well underway.

Clarke et al. state: 

“The recent availability of LLMs has opened the possibility to use them to automatically generate relevance assessments in the form of preference judgements. While the idea of automatically generated judgements has been looked at before, new-generation LLMs drive us to re-ask the question of whether human assessors are still necessary.”

However, when considering the current situation, Clarke et al. raise specific concerns around a possible degradation in the quality of relevance labeling in exchange for huge scale potentials:

Concerns about reduced quality in exchange for scale?

“It is a concern that machine-annotated assessments might degrade the quality, while dramatically increasing the number of annotations available.” 

The researchers draw parallels between the previous major shift in the information retrieval space away from professional annotators some years before to “the crowd,” continuing:

“Nevertheless, a similar change in terms of data collection paradigm was observed with the increased use of crowd assessor…such annotation tasks were delegated to crowd workers, with a substantial decrease in terms of quality of the annotation, compensated by a huge increase in annotated data.”

They surmise that the feasibility of “over time” a spectrum of balanced machine and human collaboration, or a hybrid approach to relevance labeling for evaluations, may be a way forward. 

A wide range of options from 0% machine and 100% human right across to 100% machine and 0% human is explored.

The researchers consider options whereby the human is at the beginning of the workflow providing more detailed query annotations to assist the machine in relevance evaluation, or at the end of the process to check the annotations provided by the machines.

In this paper, the researchers draw attention to the unknown risks that may emerge through the use of LLMs in relevance annotation over human crowd usage, but do concede at some point, there will likely be an industry move toward the replacement of human annotators in favor of LLMs:

“It is yet to be understood what the risks associated with such technology are: it is likely that in the next few years, we will assist in a substantial increase in the usage of LLMs to replace human annotators.”

Things move fast in the world of LLMs

But much progress can take place in a year, and despite these concerns, other researchers are already rolling with the idea of using machines as relevance labelers.

Despite the concerns raised in the Clarke et al. paper around reduced annotation quality should a large-scale move toward machine usage occur, in less than a year, there has been a significant development that impacts production search.

Very recently, Mark Sanderson, a well-respected and established information retrieval researcher, shared a slide from a presentation by Paul Thomas, one of four Bing research engineers presenting their work on the implementation of GPT-4 as relevance labelers rather than humans from the crowd. 

Researchers from Bing have made a breakthrough in using LLMs to replace “the crowd” annotators (in whole or in part) in the 2023 paper, “Large language models can accurately predict searcher preferences.” 

The enormity of this recent work by Bing (in terms of the potential change for search research) was emphasized in a tweet by Sanderson. Sanderson described the talk as “incredible,” noting, “Synthetic labels have been a holy grail of retrieval research for decades.”

While sharing the paper and subsequent case study, Thomas also shared Bing is now using GPT-4 for its relevance judgments. So, not just research, but (to an unknown extent) in production search too.

Mark Sanderson on X

So what has Bing done?

The use of GPT-4 at Bing for relevance labeling

The traditional approach of relevance evaluation typically produces a varied mixture of gold and silver labels when “the crowd” provides judgments from explicit feedback after reading “the guidelines” (Bing’s equivalent of Google’s Quality Raters Guide). 

In addition, live tests in the wild utilizing implicit feedback typically generate gold labels (the reality of the real world “human in the loop”), but with a lack of scale and high relative costs. 

Bing’s approach utilized GPT-4 LLM machine-learned pseudo-relevance annotators created and trained via prompt engineering. The purpose of these instances is to emulate quality raters to detect relevance based on a carefully selected set of gold standard labels.

This was then rolled out to provide bulk “gold label” annotations more widely via machine learning, reportedly for a fraction of the relative cost of traditional approaches. 

The prompt included telling the system that it is a search quality rater whose purpose is to assess whether documents in a set of results are relevant to a query using a label reduced to a binary relevant / not relevant judgment for consistency and to minimize complexity in the research work.

To aggregate evaluations more broadly, Bing sometimes utilized up to five pseudo-relevance labelers via machine learning per prompt.

The approach and impacts for cost, scale and purported accuracy are illustrated below and compared with other traditional explicit feedback approaches, plus implicit online evaluation.

Interestingly, two co-authors are also co-authors in Bing’s research piece, “The Crowd is Made of People,” and undoubtedly are well aware of the challenges of using the crowd.

Source: “Large language models can accurately predict searcher preferences,” Thomas et al., 2023
Source: “Large language models can accurately predict searcher preferences,” Thomas et al., 2023

With these findings, Bing researchers claim:

“To measure agreement with real searchers needs high-quality “gold” labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.” 

Scale and low-cost combined

These findings illustrate machine learning and large language models have the potential to reduce or eliminate bottlenecks in data labeling and, therefore, the evaluation process.

This is a sea-change pointing the way to an enormous step forward in how evaluation before algorithmic updates are undertaken since the potential for scale at a fraction of the cost of “the crowd” is considerable.

It’s not just Bing reporting on the success of machines over humans in relevance labeling tasks, and it’s not just ChatGPT either. Plenty of research into whether human assessors can be replaced in part or wholly by machines is certainly picking up pace in 2022 and 2023 in other research, too.

Others are reporting some success in utilizing machines over humans for relevance labeling, too

In a July 2023 paper, researchers at the University of Zurich found open source large language models (FLAN and HugginChat) outperform human crowd workers (including trained relevance annotators and consistently high-scoring crowd-sourced MTurk human relevance annotators). 

Although this work was carried out on tweet analysis rather than search results, their findings were that other open-source large language models were not only better than humans but were almost as good in their relevance labeling as ChatGPT (Alizadeh et al, 2023).

This opens the door to even more potential going forward for large-scale relevance annotations without the need for “the crowd” in its current format.

But what might come next, and what will become of ‘the crowd’ of human quality raters?

Responsible AI importance 

Caution is likely overwhelmingly front of mind for search engines. There are other highly important considerations.

Responsible AI, as yet unknown risk with these approaches, baked-in bias detection, and its removal, or at least an awareness and adjustment to bias, to name but a few. LLMs tend to “hallucinate,” and “overfitting” could present problems as well, so monitoring might well consider factors such as these with guardrails built as necessary. 

Explainable AI also calls for models to provide an explanation as to why a label or other type of output was deemed relevant, so this is another area where there will likely be further development. Researchers are also exploring ways to create bias awareness in LLM relevance judgments. 

Human relevance assessors are monitored continuously anyway, so continual monitoring is already a part of the evaluation process. However, one can presume Bing, and others, would tread much more cautiously with this machine-led approach over the “the crowd” approach. Careful monitoring will also be required to avoid drops in quality in exchange for scalability.

In outlining their approach (illustrated in the image above), Bing shared this process: 

  • Select via gold labels
  • Generate labels in bulk
  • Monitor with several methods

“Monitor with several methods” would certainly fit with a clear note of caution.

Next steps?

Bing, and others, will no doubt look to improve upon these new means of gathering annotations and relevance feedback at scale. The door is unlocked to a new agility.

A low-cost, hugely scalable relevance judgment process undoubtedly gives a strong competitive advantage when adjusting search results to meet changing information needs.

As the saying goes, the cat is out of the bag, and one could presume the research will continue to heat up to a frenzy in the information retrieval space (including other search engines) in the short to medium term.

A spectrum of human and machine assessors?

In their 2023 paper “HMC: A Spectrum of Human–Machine-Collaborative Relevance Judgement Frameworks,” Clarke et al. alluded to a feasible approach that might well mean subsequent stages of a move toward replacement of the crowd with machines taking a hybrid or spectrum form.

While a spectrum of human-machine collaboration might increase in favor of machine-learned methods as confidence grows and after careful monitoring, none of this means “the crowd” will leave entirely. The crowd may become much smaller, though, over time.

It seems unlikely that search engines (or IR research at large) would move completely away from using human relevance judges as a guardrail and a sobering sense-check or even to act as judges of the relevance labels generated by machines. Human quality raters also present a more robust means of combating “overfitting.”

Not all search areas are considered equal in terms of their potential impact on the life of searchers. Clarke et al., 2023, stress the importance of a more trusted human judgment in areas such as journalism, and this would fit well with our understanding as SEOs of Your Money or Your Life (YMYL).

The crowd might well just take on other roles depending upon the weighting in a spectrum, possibly moving into more of a supervisory role, or as an exam marker of machine-learned assessors, with exams provided for large language models requiring explanations as to how judgments were made.

Clarke et al. ask: “What weighting between human and LLMs and AI-assisted annotations is ideal?” 

What weighting of human to machine is implemented in any spectrum or hybrid approach might depend on how quickly the pace of research picks up. While not entirely comparable, if we look at the herd movement in the research space after the introduction of BERT and transformers, one can presume things will move very quickly indeed. 

Furthermore, there is also a massive move toward synthetic data already, so this “direction of travel” fits with that. 

According to Gartner:

  • “Solutions such as AI-specific data management, synthetic data and data labeling technologies, aim to solve many data challenges, including accessibility, volume, privacy, security, complexity and scope.” 
  • “By 2024, Gartner predicts 60% of data for AI will be synthetic to simulate reality, future scenarios and de-risk AI, up from 1% in 2021.” 

Will Google adopt these machine-led evaluation processes?

Given the sea-change to decades-old practices in the evaluation processes widely used by search engines, it would seem unlikely Google would not at least be looking into this very closely or even be striving towards this already. 

If the evaluation process has a bottleneck removed via the use of large language models, leading to massively reduced data sparsity for relevance labeling and algorithmic update feedback at lower costs for the same, and the potential for higher quality levels of evaluation too, there is a certain sense in “going there.”

Bing has a significant commercial advantage with this breakthrough, and Google has to stay in and lead, the AI game.

Removals of bottlenecks have the potential to massively increase scale, particularly in non-English languages and into additional markets where labeling might have been more difficult to obtain (for example, the subject matter expert areas or the nuanced queries around more technical topics). 

While we know that Google’s Search Generative Experience Beta, despite expanding to 120 countries, is still considered an experiment to learn how people might interact with or find useful, generative AI search experiences, they have already stepped over the “AI line.”

Greg Gifford on X - SGE is an experiment

However, Google is still incredibly cautious about using AI in production search.

Who can blame them for all the antitrust and legal cases, plus the prospect of reputational damage and increasing legislation related to user privacy and data protection regulations?

James Manyika, Google’s senior vice president of technology and society, speaking at Fortune’s Brainstorm AI conference in December 2022, explained:

“These technologies come with an extraordinary range of risks and challenges.” 

However, Google is not shy about undertaking research into the use of large language models. Heck, BERT came from Google in the first place. 

Certainly, Google is exploring the potential use of synthetic query generation for relevance prediction, too. Illustrated in this recent 2023 paper by Google researchers and presented at the SIGIR information retrieval conference.

Google paper 2023 on relevance prediction

Since synthetic data in AI/ML reduces other risks that might relate to privacy, security, and the use of user data, simply generating data out of thin air for relevance prediction evaluations may actually be less risky than some of the current practices.

Add to the other factors that could build a case for Google jumping on board with these new machine-driven evaluation processes (to any extent, even if the spectrum is mostly human to begin with):

  • The research in this space is heating up. 
  • Bing is running with some commercial implementation of machine over people labeling. 
  • SGE needs loads of labels.
  • There are scale challenges with the status quo.
  • The increasing spotlight on the use of low-paid workers in the data-labeling industry overall. 
  • Respected information retrieval researchers are asking is now the time to revisit the use of machines over humans in labeling?

Openly discussing evaluation as part of the update process

Google also seems to be talking much more openly of late about “evaluation” too, and how experiments and updates are undertaken following “rigorous testing.” There does seem to be a shift toward opening up the conversation with the wider community.

Here’s Danny Sullivan just last week giving an update on updates and “rigorous testing.”

Martin Splitt on X - Search Central Live

And again, explaining why Google does updates.

Greg Bernhardt on X

Search off The Record recently discussed “Steve,” an imaginary search engine, and how updates to Steve might be implemented based on the judgments of human evaluators, with potential for bias, amongst other points discussed. There was a good amount of discussion around how changes to Steve’s features were tested and so forth. 

This all seems to indicate a shift around evaluation unless I am simply imagining this.

In any event, there are already elements of machine learning in the relevance evaluation process, albeit implicit feedback. Indeed, Google recently updated its documentation on “how search works” around detecting relevant content via aggregated and anonymized user interactions.

“We transform that data into signals that help our machine-learned systems better estimate relevance.”

So perhaps following Bing’s lead is not that far a leap to take after all?

What if Google takes this approach?

What might we expect to see if Google embraces a more scalable approach to the evaluation process (huge access to more labels, potentially with higher quality, at lower cost)?

Scale, more scale, agility, and updates

Scale in the evaluation process and speedy iteration of relevance feedback and evaluations pave the way for a much greater frequency of updates, and into many languages and markets.

An evolving, iterative, alignment with true relevance, and algorithmic updates to meet this, could be ahead of us, with less broad sweeping impacts. A more agile approach overall. 

Bing takes a much more agile approach in their evaluation process already, and the breakthrough with LLM as relevance labeler makes them even more so. 

Fabrice Canel of Bing, in a recent interview, reminded us of the search engine’s constantly evolving evaluation approach where the push out of changes is not as broad sweeping and disruptive as Google’s broad core update or “big” updates. Apparently, at Bing, engineers can ideate, gain feedback quickly, and sometimes roll out changes in as little as a day or so.

All search engines will have compliance and strict review processes, which cannot be conducive to agility and will no doubt build up to a form of process debt over time as organizations age and grow. However, if the relevance evaluation process can be shortened dramatically while largely maintaining quality, this takes away at least one big blocker to algorithmic change management.

We have already seen a big increase in the number of updates this year, with three broad core updates (relevance re-evaluations at scale) between August and November and many other changes concerning spam, helpful content, and reviews in between.

Coincidentally (or probably not), we’re told “to buckle up” because major changes are coming to search. Changes designed to improve relevance and user satisfaction. All the things the crowd traditionally provides relevant feedback on.

Kenichi Suzuki on X

So, buckle up. It’s going to be an interesting ride.

rustybrick on X - Google buckle up

If Google takes this route (using machine labeling in favor of the less agile “crowd” approach), expect a lot more updates overall, and likely, many of these updates will be unannounced, too. 

We could potentially see an increased broad core update cadence with reduced impacts as agile rolling feedback helps to continually tune “relevance” and “quality” in a faster cycle of Learning to Rank, adjustment, evaluation and rollout.

Gianluca Fiorelli on X - endless updates

The post Quality rater and algorithmic evaluation systems: Are major changes coming? appeared first on Search Engine Land.


Leave a Reply

Your email address will not be published. Required fields are marked *