Can robots pick better stocks than humans

Artificial intelligence seems to be better at language tests than humans

Successor models such as RoBERTa continue to optimize these parameters. In the case of RoBERTa, scientists from Facebook and the University of Washington decided to give the system more time for pre-training, using a larger amount of data and longer sentences. At the same time, they refrained from letting the network predict subsequent sentences, which were part of BERT but which had turned out to be disadvantageous. Finally, they made the fill in the blanks harder. The result? First place at GLUE - at least for a short time.

Six weeks later, researchers from Microsoft and the University of Maryland continued the game and took first place with an improved version of the RoBERTa. But that too did not last. In the meantime, two other processes that have started to eradicate some of BERT's weaknesses have taken the top spots: XLNet, a collaboration between Carnegie Mellon University and the Google AI Brain team, and ERNIE 2.0, which is used in the research laboratories of the Chinese search engine giant Baidu originated.

"I don't even look at these new publications anymore, I find them extremely boring" (Tal Linzen)

However, just as little as someone who experiments with their baking recipes automatically understands the chemical background of baking, the continuous optimization of the pre-training inevitably leads to deeper insights into language processing. "I'll be completely honest with you: I don't even look at these new publications," says John Hopkins researcher Linzen, "I find them extremely boring." For him, the scientific puzzle is rather to find out in which sense BERT and his successors really understand language - or whether they “just pick up a few strange tricks that do well with our current test procedures”. In other words: BERT is doing a lot right. But maybe for the wrong reasons?

In July 2019, two researchers from Cheng Kung National University in Taiwan, Timothy Niven and Hung-Yu Kao, achieved a sensational result. They used BERT for a test procedure that is about logical conclusions. Specifically, it is important to identify an unspoken requirement for the validity of an argument. If the assertion reads "It is true that smoking causes cancer, because scientific studies have shown this", then it is only applicable if the prerequisite "Scientific studies are trustworthy" applies. The alternative "Scientific studies are expensive" is mostly true, but makes no sense in this context. Without practice, people achieve an average of 80 out of 100 points. BERT made 77 right from the start!

Is the clever BERT just a clever Hans?

In fact, it would be a sensation if BERT could instill in neural networks not only language understanding but also the ability to make logical conclusions. But the authors of the study themselves suspected that the explanation was more banal: BERT could have oriented itself to superficial patterns in the formulation of the requirements. As they went through their training data for evidence of this, they discovered numerous such undesirable clues. Just one example: If BERT had only learned to always select the prerequisite that contained the word “not”, he would have come up with a hit rate of 61 percent. If the researchers removed these secret paths, BERT's score dropped from 77 to 53. He was then no better than someone who determined every answer by flipping a coin. An article in the research magazine of the Stanford Artificial Intelligence Laboratory "The Gradient" praised Niven and Kao's mistrust of their supposedly sensational results in the logic test and drew a parallel to the clever Hans and his alleged knowledge of mathematics. More skepticism would do the whole field good.

Because it cannot be ruled out that there is a similar phenomenon behind BERT's superhuman performance at GLUE. For a current study, Tal Linzen and some colleagues collected evidence for precisely this assumption. They even developed an alternative data set for training the networks, which is supposed to uncover such cheats in a targeted manner - the Heuristic Analysis for Natural-Language-Inference Systems, or HANS for short.

"We have a model that has really learned something essential about language" (Sam Bowman)

Bowman also admits that the training data from the GLUE test have their weaknesses. Man-made records are always unbalanced in one way or another. “There isn't one lazy trick that you can use to do everything in GLUE,” says Bowman, but they still offer enough weak points that powerful learning systems can take advantage of - without the user noticing anything. If BERT is used with better-designed output data, its score drops noticeably in the GLUE test, as observed by computer scientist Yejin Choi from the University of Washington and the Allen Institute.

BERT just doesn't have a comprehensive understanding of the English language, says Bowman. However, he does not believe that the developers of the process built on sand: "We have a model that has really learned something essential about language."

Better evaluation procedures could help substantiate Bowman's assumption with measurable results. In mid-2019, for example, he and colleagues presented SuperGLUE, which is particularly difficult for BERT-based systems. And in fact, people are currently still in the lead in this discipline, even if only by a small margin.

But many researchers consider it questionable whether there will ever be a test that will convince us beyond any doubt that a machine is using real artificial intelligence. Just think of chess: “Chess always looked like a really good intelligence test. Until we figured out how to program a chess computer, ”says Bowman.