Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.

Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.

AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.

AI News: How is Snorkel helping to ease the laborious process of labelling data?

Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.

Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale. 

Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model. 

Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:

  1. Generalise beyond the output of the label model.
  2. Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add. 

This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.

AN: Are there dangers to implementing too much automation in the labelling process?

DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.

When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.

In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.

Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.

However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows. 

Here’s what this entails: 

  • Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale. 
  • Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was. 
  • Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.

AN: How easy is it to identify and update labels based on real-world changes?

DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.

In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective. 

In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model. 

To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.

As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions. 

AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?

DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.

To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations. 

Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.

With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.

AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?

DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).

Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.

On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.

AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?

DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.

In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.

Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.

This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.

The AI industry continued to thrive this year as companies sought ways to support business continuity through rapidly-changing situations. For those already invested, many are now doubling-down after reaping the benefits.

As we wrap up the year, it’s time to look ahead at what to expect from the AI industry in 2022.

Tackling bias

Our ‘Ethics & Society’ category got more use than most others this year, and with good reason. AI cannot thrive when it’s not trusted.

Biases are present in algorithms that are already causing harm. They’ve been the subject of many headlines, including a number of ours, and must be addressed for the public to have confidence in wider adoption.

Explainable AI (XAI) is a partial solution to the problem. XAI is artificial intelligence in which the results of the solution can be understood by humans.

Robert Penman, Associate Analyst at GlobalData, comments:

“2022 will see the further rollout of XAI, enabling companies to identify potential discrimination in their systems’ algorithms. It is essential that companies correct their models to mitigate bias in data. Organisations that drag their feet will face increasing scrutiny as AI continues to permeate our society, and people demand greater transparency. For example, in the Netherlands, the government’s use of AI to identify welfare fraud was found to violate European human rights.

Reducing human bias present in training datasets is a huge challenge in XAI implementation. Even tech giant Amazon had to scrap its in-development hiring tool because it was claimed to be biased against women.

Further, companies will be desperate to improve their XAI capabilities—the potential to avoid a PR disaster is reason enough.”

To that end, expect a large number of acquisitions of startups specialising in synthetic data training in 2022.

Smoother integration

Many companies don’t know how to get started on their AI journeys. Around 30 percent of enterprises plan to incorporate AI into their company within the next few years, but 91 percent foresee significant barriers and roadblocks.

If the confusion and anxiety that surrounds AI can be tackled, it will lead to much greater adoption.

Dr Max Versace, PhD, CEO and Co-Founder of Neurala, explains:

“Similar to what happened with the introduction of WordPress for websites in early 2000, platforms that resemble a ‘WordPress for AI’ will simplify building and maintaining AI models. 

In manufacturing for example, AI platforms will provide integration hooks, hardware flexibility, ease of use by non-experts, the ability to work with little data, and, crucially, a low-cost entry point to make this technology viable for a broad set of customers.”

AutoML platforms will thrive in 2022 and beyond.

From the cloud to the edge

The migration of AI from the cloud to the edge will accelerate in 2022.

Edge processing has a plethora of benefits over relying on cloud servers including speed, reliability, privacy, and lower costs.

Versace commented:

“Increasingly, companies are realising that the way to build a truly efficient AI algorithm is to train it on their own unique data, which might vary substantially over time. To do that effectively, the intelligence needs to directly interface with the sensors producing the data. 

From there, AI should run at a compute edge, and interface with cloud infrastructure only occasionally for backups and/or increased functionality. No critical process – for example,  in a manufacturing plant – should exclusively rely on cloud AI, exposing the manufacturing floor to connectivity/latency issues that could disrupt production.”

Expect more companies to realise the benefits of migrating from cloud to edge AI in 2022.

Doing more with less

Among the early concerns about the AI industry is that it would be dominated by “big tech” due to the gargantuan amount of data they’ve collected.

However, innovative methods are now allowing algorithms to be trained with less information. Training using smaller but more unique datasets for each deployment could prove to be more effective.

We predict more startups will prove the world doesn’t have to rely on big tech in 2022.

Human-powered AI

While XAI systems will provide results which can be understood by humans, the decisions made by AIs will be more useful because they’ll be human-powered.

Varun Ganapathi, PhD, Co-Founder and CTO at AKASA, said:

“For AI to truly be useful and effective, a human has to be present to help push the work to the finish line. Without guidance, AI can’t be expected to succeed and achieve optimal productivity. This is a trend that will only continue to increase.

Ultimately, people will have machines report to them. In this world, humans will be the managers of staff – both other humans and AIs – that will need to be taught and trained to be able to do the tasks they’re needed to do.

Just like people, AI needs to constantly be learning to improve performance.”

Greater human input also helps to build wider trust in AI. Involving humans helps to counter narratives about AI replacing jobs and concerns that decisions about people’s lives could be made without human qualities such as empathy and compassion.

Expect human input to lead to more useful AI decisions in 2022.

Avoiding captivity

The telecoms industry is currently pursuing an innovation called Open RAN which aims to help operators avoid being locked to specific vendors and help smaller competitors disrupt the relative monopoly held by a small number companies.

Enterprises are looking to avoid being held in captivity by any AI vendor.

Doug Gilbert, CIO and Chief Digital Officer at Sutherland, explains:

“Early adopters of rudimentary enterprise AI embedded in ERP / CRM platforms are starting to feel trapped. In 2022, we’ll see organisations take steps to avoid AI lock-in. And for good reason. AI is extraordinarily complex.

When embedded in, say, an ERP system, control, transparency, and innovation is handed over to the vendor not the enterprise. AI shouldn’t be treated as a product or feature: it’s a set of capabilities. AI is also evolving rapidly, with new AI capabilities and continuously improved methods of training algorithms.

To get the most powerful results from AI, more enterprises will move toward a model of combining different AI capabilities to solve unique problems or achieve an outcome. That means they’ll be looking to spin up more advanced and customizable options and either deprioritising AI features in their enterprise platforms or winding down those expensive but basic AI features altogether.”

In 2022 and beyond, we predict enterprises will favour AI solutions that avoid lock-in.

Chatbots get smart

Hands up if you’ve ever screamed (internally or externally) that you just want to speak to a human when dealing with a chatbot—I certainly have, more often than I’d care to admit.

“Today’s chatbots have proven beneficial but have very limited capabilities. Natural language processing will start to be overtaken by neural voice software that provides near real time natural language understanding (NLU),” commented Gilbert.

“With the ability to achieve comprehensive understanding of more complex sentence structures, even emotional states, break down conversations into meaningful content, quickly perform keyword detection and named entity recognition, NLU will dramatically improve the accuracy and the experience of conversational AI.”

In theory, this will have two results:

  • Augmenting human assistance in real-time, such as suggesting responses based on behaviour or based on skill level.
  • Change how a customer or client perceives they’re being treated with NLU delivering a more natural and positive experience.  

In 2022, chatbots will get much closer to offering a human-like experience.

It’s not about size, it’s about the quality

A robust AI system requires two things: a functioning model and underlying data to train that model. Collecting huge amounts of data is a waste of time if it’s not of high quality and labeled correctly.

Gabriel Straub, Chief Data Scientist at Ocado Technology, said:

“Andrew Ng has been speaking about data-centric AI, about how improving the quality of your data can often lead to better outcomes than improving your algorithms (at least for the same amount of effort.)

So, how do you do this in practice? How do you make sure that you manage the quality of data at least as carefully as the quantity of data you collect?

There are two things that will make a big difference: 1) making sure that data consumers are always at the heart of your data thinking and 2) ensuring that data governance is a function that enables you to unlock the value in your data, safely, rather than one that focuses on locking down data.”

Expect the AI industry to make the quality of data a priority in 2022.

(Photo by Michael Dziedzic on Unsplash)

Nvidia and Microsoft have developed an incredible 530 billion parameter AI model, but it still suffers from bias.

The pair claim their Megatron-Turing Natural Language Generation (MT-NLG) model is the “most powerful monolithic transformer language model trained to date”.

For comparison, OpenAI’s much-lauded GPT-3 has 175 billion parameters.

The duo trained their impressive model on 15 datasets with a total of 339 billion tokens. Various sampling weights were given to each dataset to emphasise those of a higher-quality.

The OpenWebText2 dataset – consisting of 14.8 billion tokens – was given the highest sampling weight of 19.3 percent. This was followed by CC-2021-04 – consisting of 82.6 billion tokens, the largest amount of all the datasets – with a weight of 15.7 percent. Rounding out the top three is Books 3 – a dataset with 25.7 billion tokens – that was given a weight of 14.3 percent.

However, despite the large increase in parameters, MT-NLG suffered from the same issues as its predecessors.

“While giant language models are advancing the state of the art on language generation, they also suffer from issues such as bias and toxicity,” the companies explained.

“Our observations with MT-NLG are that the model picks up stereotypes and biases from the data on which it is trained.”

Nvidia and Microsoft say they remain committed to addressing this problem.

