Better Language Model Benchmarks Needed


Recent language modelling research published by OpenAI has gained quite a bit of coverage. The stir has largely been around its decision to withhold sharing the full models or code used in the research because it claims that it could be used for malicious purposes (e.g. fake news generation, simulating a persons behaviour online). The backlash online has even suggested that — maybe — these were withheld precisely to generate more noise. The potential for malicious use has certainly been what many have focused on as reflected in this article on the BBC.

What has been mostly overlooked is the research itself. The research has built a language model (GPT-2) from 40GB of text data collected from the web and has claimed that the model has been used to gain state of the art results on a wide variety of language tasks; from text generation, question answering, translation, reading comprehension and summarisation. The researchers state this is particularly significant because unlike other models which have been trained using domain specific supervised learning methods, the GPT-2 model has been trained against a general corpus and with no knowledge of the language tasks.

This may indeed be impressive, but if you consider how the language model achieves these results then it reflects very badly on the current state of the art in Natural Language Processing and the current set of tests that are used by researchers to track progress. The OpenAI model has been built simply to predict what the next most likely word is given the words that precede it. No sense it is reading language, no understanding, nothing. You could consider it nothing more than a parlour trick, albeit one that demonstrates impressive results. Results which are probably due to the large scale of text consumed to build the models.

At Glass we are also using the web as our corpus, but we are building language and meaning into our models from the start to deliver language understanding with the support of common sense knowledge. We believe this is essential if progress is going to be made towards robust language understanding. We have begun by modelling the language that a business uses to talk about themselves to see what insights our AI can gain about market sectors and the economy at large. Over time we hope these models will grow to cover many other domains.

What the research from OpenAI does highlight is that the community needs better ways of testing language understanding if we are going to make any significant progress towards getting machines to better understand language. For that, we thank them.