At Glass, we have developed a new system for deriving large scale social and economic insights from the web and other sources. Our AI can understand written language.
In today’s post, we’re going to walk you through a formal experiment in data science we recently completed, where we looked at a pressing and surprisingly underreported real-world problem - gender inequality in the UK workplace.
We’ll explain why we took on the challenge of trying to read the entire .UK domain to do this, and some of the issues we encountered on the way. To our knowledge, this is the first systematic analysis at this scale, and our results make for disquieting reading for business leaders (both great and small) across the UK.
How our work is different
The internet is big. And it keeps getting bigger. So just as an astronomer might be interested in the formation of star systems, at Glass we’re interested in understanding large-scale activity, as revealed through the trace of activities seen across the world's universe of constantly expanding published content. This includes the ability to track information on the move as it changes shape: for example, the dynamics of a news topic as it unfolds over time. How did it get there? Where’s it going next? Who’s talking about what, where?
Previous related studies created for economists, policy-makers, or business analysts have tended to underuse or even ignore the web as a data source, typically only looking in any detail at a limited number of sectors of the economy, examining a small slice of geography or conducting manual (and expensive) surveys. Worse, given a small data set, data scientists have no choice but to extrapolate and rely on small sample statistics.
This is fine... if you want a big, blurry picture. But what if you want more pixels?
Greater resolution offers you a much finer view of the data: that’s why we needed to read over 200 million web pages just for this work. So Glass is a new kind of lens, and one we hope will make a real difference.
Lastly, we need to make sense of what we read from the web. So-called ‘natural language understanding’ (the kind we humans use, as opposed to the kind computers do) is a hard problem. For example, humans don’t just consume or create a stream of unambiguous symbols: words are both slippery and locally sensitive: we understand what words really mean from context, from the words around the words. Even the absence of certain words can determine a completely different context, hence a different meaning. So our challenge is not just to find a bunch of keywords, but to make sense of how language actually works.
Time for the experiment
In 2018, being female you can (according to official statistics) expect lower pay, worse prospects of promotion, and greater likelihood of being in one kind of industry over another. But what does this exactly look like and how does it manifest itself?
What can a new artificial intelligence (AI) pointed at websites tell us about the problem that existing methods can’t? How accurate would it be? And could it offer any new insight or uncover more detail on this important question?
We trained our AI on the entire .UK domain, and read the genders of 2.3m people and the positions they held in 150,000 organisations, across 108 industry sectors. We filtered out holding pages, low content pages, social media sites, retailers, blogs and service-oriented sites, such as search engines, because we wanted to know how UK businesses and organisations depicted themselves. Remember that these organisations have no legal obligations to present their staff with balance, and no expectation of being held to account for their choices: in that sense, the web is an unselfconscious snapshot of the organisations in it, trying to look their best.
Some sectors of the economy are ‘dark’, barely present online - the tobacco industry, for example. Some sectors create a disproportionate amount of noise - media and marketing industries, for example. So inevitably there are skews in the data to be accounted for.
But one remarkable result is that our figures precisely match the ONS (Office for National Statistics) data at the top level, and underneath, in finer resolution, we see the full picture: massive divergence between the sexes in certain roles and industry sectors. In effect, we see gender segregation at work, with only 5% of the hundred-odd sectors we surveyed showing balanced workforces.
Here’s a snapshot: