Humans in the AI loop: the data labelers behind some of the most powerful LLMs' training datasets

Who are the workers behind the training datasets powering the biggest LLMs on the market? In this explainer, we delve into data labeling as part of the AI supply chain, the labourers behind this data labeling, and how this exploitative labour ecosystem functions, aided by algorithms and larger systemic governance issues that exploit microworkers in the gig economy.

Explainer
Key points
  • High quality training data is the crucial element for producing a performing LLM, and high quality training data is labeled datasets.
  • Several digital labour platforms have arisen to the task of supplying data labeling for LLM training. However, a lack of transparency and use of algorithmic decision-making models undergirds their exploitative business models.
  • Workers are often not informed about who or what they are labeling raw datasets for, and they are subjected to algorithmic surveillance and decision-making systems that facilitate unreliable job stability and unpredictable wages.

Behind every machine is a human person who makes the cogs in that machine turn - there's the developer who builds (codes) the machine, the human evaluators who assess the basic machine's performance, even the people who build the physical parts for the machine. In the case of large language models (LLMs) powering your AI systems, this 'human person' is the invisible data labellers from all over the world who are manually annotating datasets that train the machine to recognise what is the colour 'blue,' what objects are in a photograph, or whether a chatbot's response is adequate.

With the rapid expansion and development of LLMs, the response of AI developers has been to scale up their training methods to be faster and smarter. In this explainer, we delve into data labeling as part of the AI supply chain, the labourers behind data labeling, and how this exploitative labour ecosystem functions.

Background: the training stages of an LLM

Large language models (LLMs) are advanced machine learning models, such as GPT-4, designed to understand and generate content like a human being. LLMs are trained on massive amounts of data to allow them to capture complex patterns in language to perform a wide variety of tasks related to natural language. They can be trained for specific purposes such as delivering chatbot responses, predicting text in email-writing, or summarising textual or verbal material.

At a high level, the three training stages behind such an LLM are: 1) self-supervised learning; 2) fine-tuning (supervised learning); 3) reinforcement learning.

Figure 1. Three training stages of a large language model.

Self-supervised learning is the stage in which the basic foundation model is built from raw, unlabeled yet curated data, typically a massive database like those from crawling the web (e.g., Common Crawl) or curated datasets like The Pile. The next stages of fine-tuning (supervised and reinforcement learning) are where the human labour comes in. Supervised learning entails training the AI model against a labeled dataset, whose elements are labeled by human annotators, for the model to learn 'right' responses from wrong ones. This is supplemented by reinforcement learning from human feedback (RLHF), which involves, among other steps, human evaluators creating a supervised training dataset from specified prompts, responses and human-evaluated rankings against which the model must retrain itself. To create a supervised training dataset, data labelers mark raw datapoints (images, text, sensor data, etc.) with 'labels' that help the AI model make crucial decisions, such as for an autonomous vehicle to distinguish a pedestrian from a cyclist.

Figure 2. Screenshot of Neptune AI's data labeling software as an example of what a labeling task dashboard might look like.

Models can be continually re-trained through iteration from human feedback to further fine-tune inadequate behaviours and to make the model more robust, such as  OpenAI being able to detect and prevent jailbreaks. High-quality datasets supported by humans-in-the-loop (labeling) are crucial for this training and re-training process to ensure accurate, consistent and complete ouputs. Low-quality data can result in a model producing incorrect or unfavourable outcomes, such as biased or inconsistent responses.

We differentiate two categories of data labelers behind supervised training datasets: non-subject matter specific data labelers annotating generic, large-scale datasets, and 'expert' data labelers annotating subject matter-specific datasets. The first type of data labeler has been the most common. It involves microworkers contracted from all over the world trawling through hours of at least menial, if not heinous, content to annotate for datasets that can be used to train most types of AI. These datasets might be labeled for chatbots like ChatGPT, autonomous driving algorithms or image-to-text generators. The second category of data labelers is a more niche type that requires subject matter experts, such as doctors labeling medical data or lawyers labeling legal responses, as a specialist form of 'expert fact-checking'. While both categories of data labelers are technically performing the same type of work (labeling images or text), they differ in the level of expert knowledge that may be required for domain-specific labeling demands.

Data labeling and the gen AI supply chain

Evidently, high quality training data is the key element in creating a performing LLM. Some individuals have even gone so far as to say that the algorithm itself matters much less than the training data, as the outcome data of robust training datasets is what sets apart an LLM's performance.

Numerous companies have risen to the task of facilitating the data labeling demand in the AI supply chain. These data companies can be categorised into two types of labour marketplaces for data labeling work: 1) microworker platforms that advertise a range of jobs, such as Amazon's Mechanical Turk (AMT), and 2) recently emerging platforms dedicated specifically to data labeling jobs, such as Surge AI, Scale AI (Remotasks), iMerit and Karya.

Microworker platforms like AMT function like an internet marketplace of labour where employers can post a variety of tasks with an attached price for completion, such as translation services, labeling the colors in a photo or rewriting a sentence. Platforms dedicated specifically to data labeling jobs like Remotasks, a subsidiary of Scale AI, operate similarly, assigning workers paid labeling tasks that vary in scope and scale (one task might be to label 100 photographs, another might be to label objects in an hours-long video for an autonomous vehicle algorithm). 

Figure 3. A screenshot of Remotasks' data labeling job advertisement on its website.

These digital labour platforms are effectively an alternative to hiring and managing large numbers of employees through a contractual relationship, instead outsourcing tasks to on-demand human labourers without a formal employer-employee relationship in a humans-as-a-service type of labour model. Most of these platforms operate on such a crowdsourcing model of labour, which outsources work to a constantly shifting network of 'pieceworkers', or microworkers, in place of contractual employees to perform menial tasks for the larger organisation.

Due to the vast quantities of labeled data required for supervised training that AI companies like OpenAI, which contracts the data services of Scale AI, and Microsoft, which contracts Surge AI, require, the AI supply chain has spread far and wide to countries like Kenya, India, the Philippines and Venezuela with cheaper and more quantities of labour. As we will discuss below, this has resulted in the blatant exploitation of 'humans-as-a-service', where workers are dispensible and companies can get away with paying them as little as $2 an hour for the labeling work that powers billion-dollar machines.

Behind the data labeling labour ecosystem: from lack of transparency to algorithmic surveillance

The on-going AI race kickstarted by OpenAI makes AI developers focused on getting their labeled, supervised training datasets as quickly as possible, for as cheap as possible. Disguised behind the corporate rhetoric of booming job opportunities and pay above the minimum wage (a promise not so well kept), AI developers exploit an eager work force to power their intensive demand for labeled training data. This demand translates into an exploitative labour market. While labourers are left with an unreliable source of work, underpaid wages and challenging working conditions, the major AI players face little to no accountability.

Lack of transparency around the work: workers don't know who and what machine they're labeling for

Exacerbating the power imbalance between exploited worker and exploiting AI developer is the lack of transparency around the supply chain relationship between data labeling labour platforms and their major clients demanding high-quality labeled datasets. A 2023 investigation by The Verge found that data labelers in Kenya who completed labeling tasks for Remotasks did not know that Remotasks was in fact a subsidiary of the better-known data company Scale AI, which boasts clients like OpenAI, Meta, Microsoft and even U.S. government agencies. Nor is Remotasks mentioned anywhere on ScaleAI's website or ScaleAI on Remotasks', carefully obscuring their business relationship in the public eye. This lack of transparency around the opaque business relationships between data companies and their subsidiaries is a common trend, as Surge AI, a similar data company, has also been speculated as the alleged parent company of smaller platforms Taskup.ai and DataAnnotation.tech that it uses to hire data labelers. This obscures the supply chain relationship between the end client (Open AI) and the microworkers supplying their datasets. To ensure that these business relationships are kept opaque, data labelers are even warned against revealing too much about their work.

In effect, the data labelers are completely disconnected from the AI developers demanding their labour, and this opacity in the AI supply chain relationship, a trend we've seen in other industries like the semiconductor supply chain, enables a lack of oversight and accountability to microworkers. Labourers are not provided adequate information and disclosures regarding their employment and by whom their work will be used. This is further worsened by the fact that the same investigation from The Verge discovered that some Kenyan workers didn't even know what they were even labeling the data for. Labeling items of clothing, picking out pedestrians in a video or categorising dialogue were each small pieces of a larger project for which they did not know what they were training an AI to do - and the names of the projects themselves were hidden behind indecipherable code names like 'Crab Generation' or 'Pillbox Bratwurst'.

Lack of transparency around Automated Decision Making (ADM)

There is also a lack of transparency around the algorithms that surveil workers' productivity and make important decisions for their jobs allocation and wages. Algorithmic management, loosely defined as a set of technological surveillance techniques to manage workforces and make automated or semi-automated decisions on workers' behaviour, is increasingly deployed in the contemporary workplace. Companies are relying on algorithms to surveil and monitor employees' productivity and performance in the workplace, such as in the form of 'bossware' like mouse trackers for remote workers, timers for Amazon warehouse workers and facial recognition software to monitor an employee's expression or mood in the workplace. In the gig economy context, these algorithms have the power to make critical decisions affecting workers, such as suspending a delivery driver's account without explanation or deciding how much an Uber driver might get paid for a ride depending on the time of day.

Long Read

PI, Worker Info Exchange, and App Drivers and Couriers Union have teamed up to challenge the unprecedented surveillance that gig economy workers are facing from their employers.

Many data labeling labour platforms are powered by such algorithmic decision-making (ADM) models in lieu of a human person managing the admin of every phase of the labour process, from opaque jobs allocation systems to dynamic surge-pricing models for setting wages.

As we have previously documented in our scrutiny of the gig economy, black box ADM models subject workers, who are provided no information on nor a clear understanding of the algorithm governing their employment, to unfair work conditions, unreliable job stability and unstable wages that all threaten their privacy and freedoms.

Worker surveillance

Data labeling labour platforms range in their use of workplace surveillance, but workers in Venezuela and Colombia have reported being subjected to strict timers while working - timers that often didn't accomodate for bathroom breaks - to monitor how efficiently they were completing their labeling tasks. Their failure to complete the tasks in the allotted time resulted in the task being reallocated to the pool of tasks for other takers. The MIT Technology Review tested this themselves by creating an account on Remotasks and noticed a timer on the top left of the screen, noticeably 'without a clear deadline or apparent way to pause it to go to the bathroom.' This has been interpreted as an 'inactivity timer' that pushes the task back to the task pool on the platform for someone else to claim if a worker leaves it incomplete for too long. This type of granular surveillance with no apparent explanation of how it works puts workers under pressure to compete against others, thus enabling an exploitative system for companies to get their output in the fastest way possible, at the expense of workers' mental health and physical wellbeing. Not to mention that workers have no insight into how exactly this surveillance algorithm works, so the knowledge gap puts them in a difficult position to challenge its use.

Furthermore, the complexity of certain labeling tasks, such as marking whether the reflection of a shirt in the mirror should be labeled as an item of clothing, takes different people a different amount of time to decipher. Consequently, this kind of highly pressurised environment can exclude a significant sector of the workforce punished by such intrusive workplace surveillance.

Unreliable job stability

Intensive deployment of algorithmic management tools like inactivity monitors that create undue pressure on workers at the expense of their own wellbeing also perpetuates an unstable jobs ecosystem. Allocation algorithms determine which workers might be eligible to claim which tasks based on metrics like their performance. For some data labeling microworker platforms like Appen, there is no clear system for when tasks appear in the queue, so microworkers must uninterruptedly monitor their screens to claim a job the moment it unexpectedly appears. An annotator for Remotasks in Kenya even said he has gotten in the habit of waking up every few hours at night to check his queue for tasks because many jobs pop up, without warning, late at night. While an investigation from Fairwork did credit Remotasks for 'managing job availability' based on evidence that a team is dedicated to suggesting new jobs to workers based on profile characteristics, recommending new jobs to workers does not negate the fact that workers have resorted to demeaning and unhealthy habits of interrupting their sleep to check their queue for tasks and also racing against the clock when doing tasks to avoid the risk of these very tasks being pushed back into the queue due to their completion speed. The humans-as-a-service microworker labour model exacerbates the unsustainable working conditions data labelers must endure.

The lack of transparency and disclosure of information to workers about jobs allocation algorithms has been investigated by Fairwork in the same report above, which noted that none of the microwork platforms it analysed made information available to workers about how work was allocated and when algorithms were used. How are workers to challenge these conditions or otherwise take action if they do not know how the algorithms even work, and are only left to speculate as to which scenarios they are deployed in?

Key Resources

Companies are increasingly tracking their workers and deploying unaccountable algorithms to make major employment decisions over which workers have little or no control or understanding.

While gig economy workers, content creators and warehouse operatives are at the sharp end of the algorithmic black-box, opaque and intrusive surveillance practices are embedding themselves across many industries and workplaces.

We are monitoring and recording these developments across the world so that we can catalogue harms, identify trends, and help workers know what is happening with their information.

This exploitative allocation model is symptomatic of the larger governance model favoured by Western companies shifting their labour force around countries and regions with weaker legal protections and benefits for workers. In fact, many companies do not even inform workers of their decision to move to a different market and practically disappear overnight. The lack of accountability to workers about these governance decisions further destabilises the job market for data labelers who are already struggling with 1) competing for available tasks, and 2) completing enough tasks per day to make a livable (less then minimum) wage. Another annotator in Kenya reported that tasks were drying up in the region, and it was clear that the AI supply chain, which had the advantage of not having to have a local infrastructure, was migrating to other countries with cheaper labour like Nepal and the Philippines (until the next cheaper market appears and they set up shop there). The fluid nature of the AI supply chain means it can function as 'an assembly line that can be endlessly and instantly reconfigured, moving to wherever there is the right combination of skills, bandwidth, and wages'. And in the process of these shifting parts (data labelers) in the assembly line, the companies never inform labourers of the governance decisions to shift to a new market after just months before luring them in with that corporate refrain of promising job opportunities and good wages.

Surge-model wages

There have also been reports of algorithms setting dynamic wages on data labeling labour platforms. Remotask workers, for instance, speculated that their pay may be algorithmically determined, and this was confirmed by former Scale AI employees who said that pay was determined through 'a surge-pricing-like mechanism that adjusts for how many annotators are available and how quickly the data is needed.' This is an exploitative pricing model we see often in gig economy platforms like Uber, which uses an opaque algorithm to calculate driver pay based on factors like 'the rider-to-driver demand' and whether it is a busy period. In both of these cases, workers have no insight into how the dynamic pricing algorithm calculates their wages; only the company knows what really goes on inside the black box. This causes significant detriment to the livelihoods of microworkers, as a ride that might pay £26 for one driver might pay £46 for another for the same journey. Workers are subjected to the whims of an unpredictable algorithmically controlled environment set by the company, stripping them of agency and autonomy as their wages and job opportunities are determined by a dynamic algorithm they do not understand.

It all comes back to the gig economy

Evidently, the issues we're seeing with the data labeling ecosystem are the same symptoms we've been seeing of platform work in the gig economy. For one, there's the intrusive and exploitative working conditions (e.g., strict timers inhibiting bathroom breaks) that we've similarly seen in the working conditions of Amazon warehouse workers or Facebook content moderators. This is exacerbated (and facilitated) by the lack of a clear contract and fair terms and conditions for microworkers, with some data labelers coming forward about how they were quietly ghosted from their managers without explanation. This exploitative employer-employee relationship creates an unstable and unreliable environment that workers, uninformed and without a formal employment contract, have little means to challenge.

Then there's also the matter of dynamic, unstandardised wages (also worsened by the lack of a clear contract). The dynamic pricing model subjects workers to an unreliable and unpredictable flow of income at the mercy of a black box algorithm whose inner workings they have no information about. What one task might pay could be double or half for another the next day, and workers both cannot foresee this nor understand it. It also cannot be ignored that microworker platforms typically take a certain percentage of the worker's task payment for themselves, and workers are not usually informed of what this percentage is and how it's calculated.

There is also the issue of inadequate pay in the case of unpaid training courses. Many data labeling platforms require new workers to complete unpaid training courses before they can qualify for paid tasks, but participants have noted that these courses take a significant amount of time, convoluted by lengthy tests and conflicting instructions (e.g., are reflections of shirts in the mirror considered shirts in a photo?) symptomatic of the mechanical thinking machines require. In fact, Fairwork's investigation found that 250 data labelers surveyed spent over 6 hours on average on unpaid activities, including looking for jobs and taking unpaid qualifying tests.

At a broader level beyond the algorithmic pricing model, too, is simply the lack of a standardised wages mechanism endorsed by the capitalist governance model. This can be seen in the gaping pay disparities between countries (e.g., U.S. wages for a task compared to Kenyan wages) and even within countries among different types of specialist labeling (e.g., $23 per hour for expert Finnish speakers versus $5.64 for expert Bulgarian writers). Additionally, data labelers in Kenya reported being suddenly dropped by Remotasks when the company left the Kenyan market, with hours of wages that weren't paid out to the workers and a lack of redress mechanisms to get those wages back.

This lack of standardised wages also raises larger concerns around the abuse and exploitation of the global microworker labour market. Where U.S.-based annotators might make $10-$25 an hour for the same type of work, Kenyan annotators might be making as low as $2 an hour - without even knowing that the company they are labeling data for is a corporation as big as OpenAI. In May 2024, 97 data labellers, content moderators and AI workers in Nairobi wrote an open letter to President Biden on the 60th anniversary of US-Kenyan diplomatic relations presenting a list of demands to address the exploitation and abuse of Kenyan workers in the US Big Tech supply chain. They detail the grueling nature of their work, including 'label[ing] images and text to train generative AI tools like ChatGPT for OpenAI. Our work involves watching murder and beheadings, child abuse and rape, pornography and bestiality, often for more than 8 hours a day' for less than $2 an hour.

The future of data labeling in the microworker economy

There are many other concerns yet to be addressed around the exploitative and intensive nature of the data labeling labour marketplace. In addition to the issues we've discussed around workplace surveillance, algorithmic job allocation and unstandardised wages, there is also the wellbeing of workers labeling violent or graphic content and the lack of a formalised contractual relationship with employers endowing workers with rights like redress.

Automated data labeling

It is uncertain how the data labeling labour landscape might evolve. There are increasing efforts to automate data labeling to further scale up and expand its capacity and reach. One such automated method is assisted labeling, which involves using machine learning algorithms that can identify patterns and trends in the data to pre-label datapoints which will then be passed onto human evaluators to approve or reject. Functionally, this is just a slightly simplified version of the existing human labeling scheme.

Figure 4. Screenshot of Amazon Sagemaker Ground Truth's assisted labeling model.

Assisted labeling is typically deployed in combination with active learning, another automated data labeling method such as offered by Amazon Sagemaker Ground Truth, which has humans annotate an unlabeled dataset to then create a validated model used for 'auto-labeling' that will generate confidence scores as a threshold to compare against for unlabeled datasets, which will recursively run on this process of auto-labeling based on those confidence scores.

Other companies are exploring programmatic labeling, which utilises labeling functions to annotate datapoints by matching logical calls for specified known categories (e.g., spell checking a response) and abstaining from others to pass onto human labelers to annotate. This method aims to cut down on the amount of manual labeling required by human labelers.

Nonetheless, all these above techniques to automate data labeling still require a human in the loop to some degree. It's difficult to ascertain how this might impact workers - it could improve working conditions by shielding labelers from certain types of harmful content and cutting down their labeling time, but this does not alleviate any of the larger hegemonical abuses of the gig economy exploiting microworkers. Human labelers will still be needed to some extent as long as AI exists, and automating some labeling tasks will not improve workers' unpredictable dynamic wages structure nor the job allocation ecosystem nor the fact that they may continue to be surveilled by algorithms in their performance of the work. Automating data labeling has, once again, only been a move by big companies to boost their own efficiency and productivity, not to address workers' concerns.

Looking ahead at the regulatory landscape

The future of the data labeling industry might very well be a question of the AI supply chain as a whole and how the regulatory landscape will respond to that. With the increasing demand for high-quality labeled datasets, it is hard to say how the labour ecosystem for data labeling might change to better meet the rights of workers exploited by a demanding supply chain. We see this throughout numerous industries, in which companies knowingly exploit workers to meet their supply chain needs and avoid responsibility and accountability for the working conditions of these on the ground workers, such as in the case of dangerous cobalt mining for producing batteries.

Advocacy

Privacy International sent a letter expressing its concerns and observations on the position of the Council in the current interinstitutional negotiations (trilogues) of the Platform Workers Directive (PWD). 

Data labelers are carrying the AI supply chain on their backs, and yet their status in the gig economy ecosystem means they can and will be exploited by black box algorithms and unreasonable working conditions. The fight for better treatment of microworkers in the gig economy has been a long one riddled by more and more platforms emerging in different sectors faster than policy can keep up; it took years for the European Parliament to even consider enshrining the rights of platform workers into the form of the Platform Work Directive. The International Labour Organisation (ILO) has published a report on Realising Decent Work in the Platform Economy (to be discussed at the 2025 and 2026 International Labour Conferences) that could lead to a new international labour standard based on the way countries are approaching  the prevalence and challenges of platform work. As the microworker model is increasingly utilised by big AI companies looking for cheap and quick dataset labeling, it is important for policy and advocacy to keep pace by exposing the harmful and exploitative practices of data labeling platforms and their parent companies.