Adversarial Attacks: Why Is A Neural Network Easy To Trick? - Alternative View

Table of contents:

Adversarial Attacks: Why Is A Neural Network Easy To Trick? - Alternative View
Adversarial Attacks: Why Is A Neural Network Easy To Trick? - Alternative View

Video: Adversarial Attacks: Why Is A Neural Network Easy To Trick? - Alternative View

Video: Adversarial Attacks: Why Is A Neural Network Easy To Trick? - Alternative View
Video: 'How neural networks learn' - Part II: Adversarial Examples 2024, April
Anonim

In recent years, as deep learning systems become more prevalent, scientists have demonstrated how adversarial patterns can affect anything from a simple image classifier to cancer diagnostic systems - and even create a life-threatening situation. Despite all their danger, however, adversarial examples are poorly understood. And scientists were worried: can this problem be solved?

What is an adversarial attack? This is a way to trick a neural network into producing an incorrect result. They are mainly used in scientific research to test the robustness of models against non-standard data. But in real life, as an example, you can change a few pixels in the image of a panda so that the neural network will be sure that there is a gibbon in the image. Although scientists only add "noise" to the image.

Adversarial attack: how to trick a neural network?

New work from the Massachusetts Institute of Technology points to a possible way to overcome this problem. By solving it, we could create much more reliable deep learning models that would be much more difficult to manipulate in malicious ways. But let's look at the basics of adversarial patterns first.

As you know, the power of deep learning comes from its superior ability to recognize patterns (patterns, patterns, diagrams, patterns) in data. Feed the neural network tens of thousands of tagged animal photos, and it learns which patterns are associated with a panda and which are associated with a monkey. She can then use these patterns to recognize new images of animals that she has never seen before.

But deep learning models are also very fragile. Since the image recognition system relies only on pixel patterns and not on a more conceptual understanding of what it sees, it is easy to trick it into seeing something completely different - simply by breaking the patterns in a certain way. Classic example: Add some noise to a panda image and the system classifies it as a gibbon with almost 100 percent certainty. This noise will be the adversarial attack.

Image
Image

Promotional video:

For several years, scientists have been observing this phenomenon, especially in computer vision systems, without really knowing how to get rid of such vulnerabilities. In fact, work presented last week at a major conference on artificial intelligence research - ICLR - calls into question the inevitability of adversarial attacks. It might seem that no matter how many panda images you feed to the image classifier, there will always be some kind of indignation with which you break the system.

But new work from MIT demonstrates that we were thinking wrongly about adversarial attacks. Instead of coming up with ways to collect more of the quality data that feeds the system, we need to fundamentally rethink our approach to training it.

The work demonstrates this by revealing a rather interesting property of adversarial examples that helps us understand why they are effective. What's the trick: seemingly random noise or stickers that confuse the neural network actually use very point-like, subtle patterns that the visualization system has learned to strongly associate with specific objects. In other words, the machine does not crash when we see a gibbon where we see a panda. In fact, she sees a regular arrangement of pixels, invisible to humans, which appeared much more often in pictures with gibbons than in pictures with pandas during training.

Scientists have demonstrated this by experiment: they created a dataset of images of dogs, which were all altered in such a way that the standard image classifier mistakenly identified them as cats. They then tagged these images with “cats” and used them to train a new neural network from scratch. After training, they showed the neural network real images of cats, and she correctly identified them all as cats.

The researchers hypothesized that there are two types of correlations in every dataset: patterns that actually correlate with the meaning of the data, such as whiskers in cat images or fur coloration in panda images, and patterns that exist in training data but are not propagated. to other contexts. These last "misleading" correlations, let's call them that, are used in adversarial attacks. A recognition system, trained to recognize "misleading" patterns, finds them and thinks it sees a monkey.

This tells us that if we want to eliminate the risk of an adversarial attack, we need to change the way we train our models. We are currently letting the neural network select the correlations it wants to use to identify objects in the image. As a result, we have no control over the correlations it finds, whether they are real or misleading. If, instead, we trained our models to remember only real patterns - which are tied to meaningful pixels - in theory it would be possible to produce deep learning systems that couldn't be confused.

When scientists tested this idea, using only real correlations to train their model, they actually reduced its vulnerability: it was manipulated only 50% of the time, while a model trained on real and false correlations was manipulated 95% of the time.

In short, you can defend against adversarial attacks. But we need more research to eliminate them completely.

Ilya Khel