Words are not numbers - the barrier that stops deep learning from understanding language.


On the back of developments in deep learning and general progress in Artificial Intelligence, there has been impressive progress in Natural Language Processing (NLP) over the past few years. Starting with the popularity of word embedding techniques such as Word2Vec and GloVe, to recent developments in transfer learning using pre-trained models like Allen AI’s ELMo, Google AI’s BERT and Open AI’s GPT-2, not a week goes by without a new set of state of the art numbers being reported.

Fundamental to these NLP approaches based on deep learning is the step of turning words into a form that can be understood by the neural networks that are being trained to complete the task at hand. This means transforming words into a vector space, a set of numbers. Unfortunately, these statistical language representations are unable to simultaneously capture both the precision and flexibility with which language is used. Rather than enabling progress towards understanding language, it seems words are imprisoned by these vector spaces and are not able to break free from the constraints of their training set (however big it is!).

Counting words

Early attempts at NLP focused on hand-crafting or inferring syntactic rules that allowed language to be parsed. These approaches proved to be brittle, incomplete and didn’t provide a clear path towards general language understanding. With the growing importance of information retrieval, and search in particular, and starting with simple techniques such as bag of words and tf-idf, statistical approaches came into play. NLP systems have followed suit, counting words in ever more sophisticated ways.

Although not an NLP system as such, it is illustrative to look at Google’s search as a baseline for statistical approaches to language. Using click-through rate as a proxy for match quality, usage stats¹ show that 31% of users click on the first result, and 75% of all clicks are in the top 3 results. Google seems to be doing a great job of selecting the right result for specific queries. But, you will know from your own searches that, as soon as you look beyond the first page of results that the matches become much weaker (less than 1% of results are ever clicked after the first page). A recent study² also showed that almost half of all searches now result in no clickthroughs. This may be because the answer that the searcher was looking for is available in the snippets or structured search results that Google displays, but it also contains cases where the searcher keeps needing to reformulate their search question to find an acceptable answer, or a suitable answer just wasn’t matched. Matches beyond the structured results that Google provides (e.g. Wikipedia, places, the answer box, “people also ask”) look increasingly poor.

That’s all very well for search, but how does this translate into an NLP system? Google is only dealing with one simple question, matching to the best result, and mostly finds the right answer. Unfortunately, ‘mostly’ is not good enough when you are reading whole articles or trying to carry out a conversation. Any single error can be amplified by any text or interpretation that follows. Even with very high rates of accuracy (way beyond the current state of the art), ‘nearly right’ can quickly turn into ‘rarely right’ as a conversation proceeds or text is (mis)understood.



With its ever-increasing accuracy, one language task that many people think is solved is machine translation³. It was one of the first areas where NLP from the lab was let loose on the public at scale. However, taking a simple extract from Dr. Seuss quickly highlights the challenges that remain. Dr. Seuss is great (of course!) because he uses nice simple words and sentence structures. We experimented with Google Translate⁴, which started to use Neural Machine Translation (NMT) in 2016, and translated from English through the top 10: spoken languages, native languages of internet users and online content languages⁵, and back into English to see how close they were to the original text.


While we might not expect to get the poetry of the original, all the translations seem to make some basic missteps along the way. All versions seem to have problems telling the difference between feet, shoes and legs, and have different opinions on whether you, I or they are on their own! Comparing the different language types, the translation through the top 10 languages by online content does the best (49% match based on straight word count). It even retains a bit of the rhyming in the last sentences. This is to be expected because these languages will have the most online content from which the NMT can be trained. The user language (34%) and spoken language (32%) translations do much worse, and make clear nonsensical errors that the content-based results largely avoid — “one foot on one foot” “legs in your legs” — although Dr. Seuss might be proud of!

This example used a very short text with simple words and sentence structure. Imagine if full texts are to be read and translated using a full range of language, where a translation choice might depend on some statement earlier in the text. At a glance, translations look ‘good enough’ but digging deeper they leave a lot to be desired.

Understanding words

As we have said, fundamental to the popular deep learning approaches to NLP is the translation of words into a vector space. These vector spaces are representations of the meaning of a word by putting it into the context of other similar words. One of the most famous examples of this in action — which seems to demonstrate semantic knowledge — comes from embeddings trained using Word2Vec⁶:

vector(”King”) — vector(”Man”) + vector(”Woman”) -> vector(“Queen”)

There are differences between how word embeddings are generated based on the approach but fundamentally they are trained by trying to predict a word from the context of surrounding words in an existing body of text.

Jane was phishing for [ — — — ] account passwords // George was fishing from the river [ — — — ]

A likely word for both these sentences is “bank”. However, this example illustrates a problem that existed with early approaches to word embeddings. A word could only have a single representation. In these cases, we want to distinguish between a bank that looks after money vs. a bank at the side of a river. This was addressed with the introduction of contextual word embeddings, such as ELMo⁷, where both the word and the context are held in the vector representation. This change, with better inputs, led to a step-change shift in state of the art across a broad range of NLP benchmarks.


If contextual word embeddings have improved state of the art then why are we still having trouble with simple translations? We can get a clue to this by examining the ELMo contextual word embeddings for some of these translations. These pre-trained models have been built from similar content (i.e. open web sources) that have been used by Google to build their NMT models, so they may exhibit similar biases.

  • (A) The embeddings for the translations of “You have brains in your head” show that all the embeddings equate “brain” and “brains” to the lump of material in your head (top left area). There is no recognition of the other meaning of brains to mean intelligence, although if you try “brainy” then this moves close to smart or intelligence (bottom right area).

  • (B) For translations of “You have feet in your shoes” you can see that in the context of “in/on your” that feet, shoes and legs are very closely aligned (top right area). But “feet”, “shoes” and “legs” are seen as quite different attributes when viewed in other contexts.

Both of these embeddings reinforce the interpretation that the various translations have made on the original text. So while contextual word embeddings have enabled across the board improvements in NLP tasks, they are still extremely limited in representing the meaning of words and constrained by the data that was used to train them.

So, building very large models based on parsing large chunks of content (e.g. the large ELMo model has 93.6M parameters and was built on 5.5B tokens sourced from the web) doesn’t seem to have captured sufficient detail to accurately understand the meaning of words. Where can the definitions of words be captured that will allow accurate interpretation and generalisation? The same place a human would go, the dictionary.


Clearly the dictionary contains the answers to the multiple meanings of brains and the difference between legs, feet and shoes. Building a model that represents this knowledge can offer a similar capability to word embeddings, understanding the relationships between words and the context in which they are used, but without losing accuracy and not limited by the training set. This approach does not fit with current deep learning-based strategies as the model is necessarily symbolic rather than statistical. Although symbolic, these models are distinct from early symbolic attempts at NLP which were focused around the syntax. These language models are focused on semantics; capturing the meaning of words, their roles within the text, and combining them to begin to understand language.


There may be more gains to be made and new benchmarks to be created as ever more complex neural language models are created and more training data is read; but just because there is a lot of data available, it doesn’t mean we should train on it, and just because the compute resource is there doesn’t mean we should consume it. There already seem to be decreasing returns as models and training data are scaled up massively but only lead to relatively small improvements in precision⁸. We expect that this will continue as accuracy starts to become an intractable problem and small errors lead to bigger ones as more text needs to be understood or as a conversation proceeds⁹. With statistical models, generalisation to content beyond the training set will always lose precision.

The answer to this loss of precision is bootstrapping language understanding by building precise representations of small languages and using these to understand larger texts. Much as you learnt words as a child as you interacted with the world; discovering the meaning of a word and slowly starting to understand how to combine words to be understood yourself. At glass.ai, we have been building small domain-specific language models to perform a wide variety of social, economic and business research from open web content. These required the application of the models across core NLP tasks, such as text classification, word disambiguation, entity recognition, information extraction, semantic role labelling, entity linking, and sentiment analysis, and have been used to explore, categorise, summarise, and map huge chunks of the open web. This large scale consumption of text has demonstrated that the precision of these models means they can generalise to and ultimately understand very large bodies of unseen content.


[1]: Brian Dean. Here’s What We Learned About
Organic Click Through Rate.
 https://backlinko.com/google-ctr-stats. 2019.

[2]: George Nguyen. 49% of all Google searches are no-click, study finds. https://searchengineland.com/49-of-all-google-searches-are-no-click-study-finds-318426. 2019.

[3]: Cade Metz. An Infusion of AI Makes Google Translate More Powerful Than Ever. https://www.wired.com/2016/09/google-claims-ai-breakthrough-machine-translation/. 2016.

[4]: Yonghui Wu et. al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://ai.google/research/pubs/pub45610. 2016.

[5]: Thomas Devlin. What Are The Most-Used Languages On The Internet? https://www.babbel.com/en/magazine/internet-language/. 2019.

[6]: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781. 2013.

[7]: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. https://arxiv.org/abs/1802.05365. 2018.

[8]: Alec Radford et. al. Better Language Models
and Their Implications.
 https://openai.com/blog/better-language-models/. 2019.

[9]: Christopher D. Manning. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? https://dl.acm.org/citation.cfm?id=1964816. 2011.

Creating a granular picture of regional economies.


If you are a policymaker that needs to develop the industrial strategy for your region or a company needing to know the best place to locate your business for growth, then where are you going to find the right level of knowledge to make those decisions? Traditionally you might turn to official statistics or business surveys, but these tend to be at least a year out of date and at an aggregate level that doesn’t really help with accurate decision making.

To this end, in an article recently published by the Economic Statistics Centre of Excellence (ESCoE), Nesta have demonstrated the use of novel data sources to provide a level regional economic knowledge previously not available. Using the descriptions and locations of UK businesses mined from the open web by Glass.ai and natural language understanding technology, Nesta has studied the link between the industrial ecosystem of the UK regions and the number of businesses within those regions engaged in emergent activities. Natural language processing techniques have been used to extract topics describing the activities that are being performed by the businesses to highlight the specialisms that are present across the regions of the UK and to provide the basis for a novel approach to examining economic complexity. The study compares the density of an identified subset of emerging activities with the economic complexity of the regions to suggest sources of guidance for policy or decision making.

The blog summarising the study with links to the full article is available here.

Using AI and web data to understand the drivers of productivity.


The West Yorkshire Combined Authority (WYCA) wished to explore how open web data and machine learning techniques could enhance official business data to help understand the drivers of productivity at companies. Funding for the project came via the BEIS Business Basics Programme, which is part of the Industrial Strategy, and delivered in partnership with Innovate UK and the Innovation Growth Lab at Nesta.

The Glass.ai engine has read and mapped a very large part of the UK economy based on its web presence. It regularly reads the websites of 1.5 million UK businesses (200M web pages) across sectors and geographies. It reads and structures all the text on the websites. Working with WYCA, a sample of this data was used to investigate indicators of high productivity companies.


WYCA supplied a list of 3,491 companies in the WYCA region. Of these companies, 2,929 had web addresses assigned so were candidates for inclusion in the analysis. WYCA also supplied 2,856 companies with productivity data.

The list was compared with the Glass.ai dataset and we were able to match 2,491 companies with the data. The missed companies in the matched set were due to duplicates and dead sites. Duplicates included groupings of related companies, for example, Group company entities, Holding entities and operational company entities.

The data collected from the web includes descriptions, social media accounts, addresses and emails collected from the organisation website, also counts of related entities (e.g. news, people) found on the site. From social media and the open web, there are also web presence indicators. The descriptions and other text found on the organisation websites were also used to predict the sector(s) that the company operates in and the main topics related to the company’s activities.

The full topic list was used to investigate relationships to indicators of high productivity.


We used the data collected from the web to find indicators of high productivity. For this analysis, we used those companies in Glass.ai that had corresponding productivity data from WYCA. This totalled 1,360 companies. It is worth highlighting that although there may be a correlation between the indicators and productivity that this does not imply causality.

To perform this analysis we identified the set of high productivity and low productivity companies based on the ranked values from WYCA. What we looked at first was whether there was a bias in the productivity sets based on the broad industry sectors provided by WYCA.

Average productivity rank by industry sector

Average productivity rank by industry sector

This chart shows that there isn’t an even distribution of organisations over the productivity ranking. For example, non-profit organisations often appear with low productivity. To remove this imbalance we selected the top and bottom 10 organisations by rank to form the sets of high and low productivity companies. This ensured that particular industry characteristics — rather than general organisation characteristics — wouldn’t dominate the chosen sets of organisations. Each set, high and low productivity companies, contained 140 organisations.

Using these groupings we examined the various characteristics collected from the web. The following charts show the ratio of the share of results in the high (blue) and low (orange) productivity groups of companies. For example, these show a positive correlation between published news and high productivity, but a negative correlation between LinkedIn references.

Share of high and low productivity companies by characteristic

Share of high and low productivity companies by characteristic

There were specific themes that WYCA also wanted to explore in the data. The areas under investigation were Export, Innovation, Awards, Patents and Certification. For each of these, we mapped the areas onto sets of topics.

  • Export: import, export, trade, import taxes, import duties, imported goods, direct imports, international business, Chinese market, Russian market, French market, Japanese market, Australian market, American market, German market, European market, foreign markets, export cargo, export-import, export markets, import-export trade, export licensing, export products, international market, overseas markets, global export, international trade, major world markets, select international markets, key international markets, major international markets, Latin American markets, foreign markets, overseas markets, international market, Russian market, Chinese market, Japanese market, Australian market, European market, American market, Italian market, global expansion plans, international expansion plan, global growth, international growth, international expansion, global expansion, European expansion

  • Innovation: technological change, digital transformation, digital revolution, technological revolution, technological evolution, new emerging technologies, emerging technologies, disruptive innovation, innovation, product innovation, process innovation, industrial innovation, innovation system, innovation management, open innovation, research development

  • Awards: numerous accolades, prestigious industry awards, numerous industry awards, multiple awards, numerous awards, award nominations, national award, annual awards, industry awards, special award, awards

  • Patents: patent, patents, patented, patent pending

  • Certification: ISO, BSI

These groupings were then counted in the high and low productivity sets of companies to highlight any correlation with productivity, and show a positive relationship between export and innovation and high productivity, but interestingly a relationship between patents and low productivity — maybe to do with the cost of patent development or the characteristics of the industries these were found.

Share of high and low productivity companies by TOPIC

Share of high and low productivity companies by TOPIC

The final part of the analysis examined all the topics that appeared in the high and low productivity sets of companies and clustered them into groupings.

The clustering used was a form hierarchical agglomerative clustering based on a bottom-up approach. At first, each keyword is treated as a single entity cluster. Then, iteratively, clusters are merged in pairs based on an ad-hoc metric for measuring the semantic similarity between the keywords in the two clusters. The process continues until the similarity between any two clusters is lower than a given threshold. The remaining clusters are considered to be semantically coherent sets of keywords.

Using this method 73 clusters of topics were identified. As an example, the top two — most semantically — coherent clusters were as follows.

  • cluster_0: auditing,business ethics,continual improvement process,corporate governance,corporate social responsibility,environmental health,environmental impact,environmental impact assessment,environmental management system,environmental resource management,environmental responsibility,good practice,governance,health safety,impact assessment,information systems,integrated management system,management facilities,management systems,procurement,product development,professional development,project delivery,project management,quality assurance,quality control,quality management,quality systems,resource management,risk assessment,risk management framework,safety,safety management,safety management systems,strategic management,strategic planning,strategy,successful delivery,supply chain,sustainability,sustainability performance,sustainable development,transformation programmes

  • cluster_1: carbon,carbon emissions,district heating,district heating schemes,energy,energy consumption,energy efficiency,energy industry,energy scheme,energy wastage,fire protection,fire safety,generating capacity,heat network,heating,heating system,meter reading,onshore wind,renewable energy,renewable energy companies,renewable generation,smart meter,solar farm,solar project,sustainable energy,thermal efficiency,turbines,wind,wind farms

For each cluster, we then checked whether the cluster’s share in the high and low productivity sets to determine the ones that were better indicators of high productivity. We looked at the individual topics and the clusters as a whole. These led to highlighting the following sets of topics as high productivity indicators.

  • fleet operations,fuel efficiency,fuel usage,tracking,tracking solutions,tracking system,tracking tools,vehicle security,vehicle tracking solutions,vehicle tracking system

  • European patent,intellectual property,patent attorneys,patent law,patent offices,regulatory agencies,regulatory compliance,regulatory expertise,regulatory strategy,trademark

  • Europe, global, globes, world

  • carbon emissions,district heating,district heating schemes,energy,energy consumption,energy efficiency,energy industry,energy scheme,energy wastage,fire protection,fire safety,generating capacity,heat network,heating,heating system,meter reading,onshore wind,renewable energy,renewable energy companies,renewable generation,smart meter,solar farm,solar project,sustainable energy,thermal efficiency,turbines,wind,wind farms

  • capital,cash,credit,credit risk,equity funds,financial intermediary,funds,investee companies,investment,investment opportunities,loan fund,loans,venture capital,working capital,working capital loans

  • city, village

  • leaders, leadership, pioneers

  • oil,power,transportation,water

  • legislation,member state,policy,regulation,rules,use policy

  • business partner,business results,competitive advantage,customer experience,customer focus,customer relationship management,customer satisfaction,customer service,flexible approach,forward thinking,high quality facilities,higher level,operating cost,operational efficiency,partner companies,personal service,product performance,product quality,professional team,proven track record,quality customer,senior management,service level,staff members,teams,technical excellence,track record

  • community, organizations, partnerships, works

  • qualifications, training

  • creative business, entrepreneurs, new businesses

  • business ethics,continual improvement process,corporate governance,corporate social responsibility,environmental health,environmental impact,environmental impact assessment,environmental management system,environmental resource management,environmental responsibility,good practice,governance,health safety,impact assessment,information systems,integrated management system,management facilities,management systems,procurement,product development,professional development,project delivery,project management,quality assurance,quality control,quality management,quality systems,resource management,risk assessment,risk management framework,safety,safety management,safety management systems,strategic management,strategic planning,strategy,successful delivery,supply chain,sustainability,sustainability performance,sustainable development,transformation programmes


We have seen that the rich set of data that can be collected from the open web can provide indicators of high productivity companies. This includes data collected from the organisation’s website, social media or online news services. Each of these contains positive indicators of productivity. In particular, we can look at the key topics that the organisations talk about as indicators of productivity. Export and Innovation appear to be high productivity indicators, but also clusters around the fleet’s operation, brand protection, corporate governance, energy efficiency and customer service seem to correlate with high productivity. Although this is not an indicator of causality, it does provide lots of clues for further investigation.

We would like to thank the WYCA team for inviting us to work on this project, in particular, James Hopton and Alex Clarke for interesting discussions on the drivers of productivity.

Exploring the Artificial Intelligence patent landscape.


Artificial Intelligence (AI) is growing at a great pace and is spreading across many industry sectors. The UK government is dedicated to advancing the UK’s AI sector, which is estimated to add £630bn to the UK economy by 2035; AI is one of the four Grand Challenges forming the UK government’s Industrial Strategy which aims to boost the productivity and earning power of people across the UK.

The Intellectual Property Office in the UK has published a report with Glass.ai that studies the technology trends across the AI patent landscape across the world with some additional focus on the UK. It is one of the first to look more closely at patenting activity within the UK’s AI sector and how this compares with other countries. It provides insights into the leading UK-based applicants in the field, the location and extent of their future markets, as well as attempting to identify specific strengths within the UK’s AI sector.

Read the full report here.

70% of the internet isn’t there, and the useful internet is smaller than we think.


As part of our ongoing AI-driven economic and social research at Glass, our AI is continuously reading the internet to discover new businesses, understand the activities they are involved in and mapping those activities into their appropriate industries. We are digitally mapping the world’s economy. As well as enabling insights into markets and industry sectors, our robo-researcher has uncovered interesting insights about the makeup of the internet itself. Here is what we found:

According to the recently published Q4 2018 Verisign Domain Name Industry Report, there are 348.7 million domain names registered across the globe. The Glass AI intelligent web researcher has read the domains from just over half that number of sites. What it discovered is that the vast majority of registered domains are not linked to regular working websites. Whilst a large number of these are simply dead domains without an active website (44%), we also discovered that a large number of active sites are not providing web content.


Parked domains constitute 22% of the total and fall into several categories. Often a parked domain is simply used as a link farm to serve ads for other sites or improve a site's visibility to search engine algorithms. A parked domain may also simply be there as a means to try to sell the domain name itself or the sites hosting company may be using a lapsed domain registration as an opportunity to advertise their own services. Finally, the site on a domain might be presenting a holding page whilst the site is being developed or in maintenance. Sometimes these sites can be in this state for years!

(A) Dead site, (B) Link farm, (C) For sale, (D) Host services, (E) In development, (F) Redirects.

(A) Dead site, (B) Link farm, (C) For sale, (D) Host services, (E) In development, (F) Redirects.

There are also several reasons why a domain may redirect to a site on a different domain (6% of total). For example, a company may have changed its name but still want to keep its old address active to point to its new location, or it could have been acquired in which case the new parent company wants the old web address to link into their own content. Or, the redirected domain may just provide a name that is easier to remember for accessing content within some other platform, in particular, linking to a social media presence. Verisign’s own analysis shows that a growing number of domains are used for this purpose, increasing up to 50% in the previous year for certain social media sites. There are also drivers from the complex waters of search engine optimisation where an organisation wants to present different access points for their site. We have seen that this can sometimes be taken to extremes. For example, we discovered a consulting company promoting a methodology they called S.W.I.M that had hijacked the expired domains of hundreds of swimming clubs in order to try to expand its reach. You can’t imagine anyone following links to their local swim club only to be redirected to a consulting firm would have been particularly pleased!

(G) Duplicate services, (H) Duplicate locations.

(G) Duplicate services, (H) Duplicate locations.

The other category is duplicate sites. This occurs when multiple domains contain exactly the same or very similar content. Again the driver for this is search engine optimisation, trying to improve the presence of an organisation on search engines. Where exactly the same content appears, it’s a similar picture to when multiple domains are using redirects but there are other more subtle cases. One example is where an organisation wants to use different domains to highlight different services they offer. The content is very similar on each site but maybe the landing page has different content. Alternatively, an organisation may offer a service in different areas and the provider has created a domain for each specific town or area. We have observed cases where a local plumber has hundreds of web domains, one each town or village in the area they cover. The sites are the same with the exception of a focus on a different location.

So how big is the ‘useful’ internet?

With each of these “missing” segments covering large chunks of registered domains, that leaves under 30% of the internet delivering meaningful web content. We are talking about 100 million live websites. At Glass, our AI is tuned to identify business, non-profit, government and education websites with the aim of drawing insights on markets, social and economic activity. Based on the sites that have already been read and categorised by the AI, we estimate about one-third of the remaining live sites contain organisation content that fits into these categories. Other content currently ignored by our AI includes personal sites, blogs, and certain sorts of consumer-oriented sites. So, to digitally map the world’s economic graph, we estimate that we will need to read and understand approximately 32 million websites of which our AI system says 52% contain English language content. But as we’ve seen, even before we dig into the content that has been read, there are interesting things to be discovered about the structure of the internet itself. It appears the useful internet is smaller than we think!


Are Vanity Awards worth it? An experiment in identifying web influence.


You may have received an email yourself, congratulating your company on being named in the Top 30 or 50 or 100 Best, Fastest or Friendliest companies in A, B or C industry. To receive your award all you need to do it pay a couple of $1,000 and a whole host of positive publicity is yours!

If you do a quick search for these awards then you will find some negative comments on social media, blogs pieces and other articles. But you will also find a lot of content published about the awards, so what’s the story. Are these awards useful for generating publicity — maybe they are worth it?

At Glass, we have developed artificial intelligence that is able to understand natural language at scale and have pointed it at the internet to digitally map the world’s economy. Using our AI, we investigated whether these awards are useful marketing tools based on the footprint they leave on the web.


Share of web presence for vanity award news sources.

We know that there are negative comments about these awards in social media and other articles, but it is interesting to look at the content that is being generated by these awards when they are published. Looking at a sample of awards over the past year, it appears that the reach of the awards fits into 3 categories: PR distribution services, copy and paste content and the companies that have received the awards themselves.

  • PR distribution services typically offer low-cost mechanisms for sharing a press release to large numbers of online services and news outlets. They offer a spray and pray approach to public relations and seem to be the major outlet for the releases published by the awards.

  • The award releases do seem to occasionally appear on other sites, but if you dig a little deeper then you see that the content is identical to parts of the original release and the sites they appear on are minor company blogs; probably where the owner is looking for any sort of story related to their own industry to help with their content marketing. Interestingly, given the nature of awards, these sorts of stories seem to get picked up less often — because who wants to advertise competing companies — than other releases published by the PR distributors.

  • As you would expect the companies receiving the awards also publish information about the awards on their own websites. This is no surprise but it is interesting to view the types of companies that are receiving the “top” awards. There are certainly exceptions but if you look at other mentions they are having in the press then many of the organisations seem to have had very little impact on their own industry.

The evidence suggests there is very little chance of one of these award categories being picked up by any mainstream or regular media outlets. So whilst they do generate a footprint on the web, as an organisation you might be better off developing your own unique story and sending it to the PR distribution services directly — although another interesting piece of web analysis might be to examine the reach and impact of these services themselves. That said, nothing beats developing a good direct relationship with journalists covering your sector.

The OECD uses Glass to understand vast amounts of text and gather insights from the web.

OECD 17.58.25.png

The Organisation for Economic Co-operation and Development (OECD) today held a high-level event in Paris to map the way forward on digital era policymaking. We were kindly invited to attend as the event also saw the launch of a huge report measuring the state of digital transformation. Glass has contributed to this report and we continue to work with the OECD on other exciting projects. 

Measuring the Digital Transformation

Sound measurement is crucial for evidence-based policy making, but existing ‘official’ metrics and measurement tools struggle to keep up with the rapid pace of digital transformation. Governments and leading economic institutions like the OECD recognise this and are exploring new complementary approaches. Here is where Glass comes in. 

After discussions with senior economists and data scientists at the OECD, it was suggested that an initial case study would be to map the AI ecosystem in the UK. Using the Glass AI capability, the OECD team found the following:

Case Study: mapping the activities of UK AI companies

AI companies in different industries are developing and applying different types of AI-related technologies. Everybody holds that Artificial Intelligence is permeating all sectors of the economy. However, little is known about which type of AI technologies and approaches are being developed and used in different sectors, and for which purpose. 

Using the Glass AI capability that understands web text at scale, the OECD has revealed the existence of about 6 thousand AI-related companies in the United Kingdom alone in 2018, about 2.8 thousands of which explicitly mentioning on their website to be active in AI. These companies appear to combine different AI-related technologies and approaches, depending on their application field or area of activity. For instance, about 400 companies are focusing on deep learning and are relying on automation-related technologies and, to a lesser extent, on data analytics. About 300 companies advancing AI in robotics, the Internet of Things (IoT) and virtual reality (VR) are focusing on automation and, to a lesser extent, on natural language processing (NLP). About 250 AI companies are focusing on analytics coupled with recognition-related technologies when developing e-commerce-related AI technologies purposes. About as many companies rely on different combinations of the same technologies in their data mining and business solutions-related developments, respectively.


AI-related companies in the United Kingdom, by the focus of activity, 2018

AI companies: shaping key sectors of the UK economy

The OECD has also gained more insights about the type of AI technologies that these companies are developing and applying by narrowing the focus on a few key sectors of the UK economy. In particular, Financial Services, Professional Services, and ICT manufacturing and service activities, which in 2017 accounted for 22.7% of total employment (7.3 million persons, up from 6.0 million in 2010) and for 53% of investment (i.e. gross fixed capital formation, GFCF) in ICT equipment.

Of the 2.8 thousand UK AI companies in the sample that explicitly state to be actively pursuing AI-related activities, 829 appear to operate in ICT manufacturing and services activities, 693 to professional activities and 162 to financial and insurance activities, a total of sixty percent of the sample. The other forty percent is distributed across ten sectors ranging from agriculture to real estate and construction. Some of them are developing and using several types of AI-related technologies, whereas others appear to be very much focused on one only area. Also, different technologies appear to be developed to relatively different extents. UK AI-active companies in ICT manufacturing and services are focusing their efforts in technologies related to language processing, business solutions and deep learning. By contrast, companies in Professional services are especially concerned with language processing, image recognition and robotics, IoT and virtual reality-related technologies. Finally, finance and insurance companies appear to be especially active in autonomous vehicles-related technologies, in deep learning, and in robotics, IoT and virtual reality.


AI-related technologies developed by UK companies, by the main sectors, 2018


This study has demonstrated again the potential of AI-based research using the open web as a live mirror of complex social and economic issues. We are entering a new era of economic measurement where national statistical offices will have to open up to new technologies and sources of data. We are very excited that the OECD is now using Glass to expand their evidence base and to gather economic insights from the web. It’s the start of a collaboration that will lead to many exciting opportunities and it is also helping us in our mission to digitally map the world’s economy. More to come!

Better Language Model Benchmarks Needed


Recent language modelling research published by OpenAI has gained quite a bit of coverage. The stir has largely been around its decision to withhold sharing the full models or code used in the research because it claims that it could be used for malicious purposes (e.g. fake news generation, simulating a persons behaviour online). The backlash online has even suggested that — maybe — these were withheld precisely to generate more noise. The potential for malicious use has certainly been what many have focused on as reflected in this article on the BBC.

What has been mostly overlooked is the research itself. The research has built a language model (GPT-2) from 40GB of text data collected from the web and has claimed that the model has been used to gain state of the art results on a wide variety of language tasks; from text generation, question answering, translation, reading comprehension and summarisation. The researchers state this is particularly significant because unlike other models which have been trained using domain specific supervised learning methods, the GPT-2 model has been trained against a general corpus and with no knowledge of the language tasks.

This may indeed be impressive, but if you consider how the language model achieves these results then it reflects very badly on the current state of the art in Natural Language Processing and the current set of tests that are used by researchers to track progress. The OpenAI model has been built simply to predict what the next most likely word is given the words that precede it. No sense it is reading language, no understanding, nothing. You could consider it nothing more than a parlour trick, albeit one that demonstrates impressive results. Results which are probably due to the large scale of text consumed to build the models.

At Glass we are also using the web as our corpus, but we are building language and meaning into our models from the start to deliver language understanding with the support of common sense knowledge. We believe this is essential if progress is going to be made towards robust language understanding. We have begun by modelling the language that a business uses to talk about themselves to see what insights our AI can gain about market sectors and the economy at large. Over time we hope these models will grow to cover many other domains.

What the research from OpenAI does highlight is that the community needs better ways of testing language understanding if we are going to make any significant progress towards getting machines to better understand language. For that, we thank them.

How we used our AI to uncover gender bias in the UK workplace


 At Glass, we have developed a new system for deriving large scale social and economic insights from the web and other sources. Our AI can understand written language. 

 In today’s post, we’re going to walk you through a formal experiment in data science we recently completed, where we looked at a pressing and surprisingly underreported real-world problem - gender inequality in the UK workplace. 

We’ll explain why we took on the challenge of trying to read the entire .UK domain to do this, and some of the issues we encountered on the way. To our knowledge, this is the first systematic analysis at this scale, and our results make for disquieting reading for business leaders (both great and small) across the UK. 

How our work is different

The internet is big. And it keeps getting bigger. So just as an astronomer might be interested in the formation of star systems, at Glass we’re interested in understanding large-scale activity, as revealed through the trace of activities seen across the world's universe of constantly expanding published content. This includes the ability to track information on the move as it changes shape: for example, the dynamics of a news topic as it unfolds over time. How did it get there? Where’s it going next? Who’s talking about what, where? 

Previous related studies created for economists, policy-makers, or business analysts have tended to underuse or even ignore the web as a data source, typically only looking in any detail at a limited number of sectors of the economy, examining a small slice of geography or conducting manual (and expensive) surveys. Worse, given a small data set, data scientists have no choice but to extrapolate and rely on small sample statistics. 

This is fine... if you want a big, blurry picture. But what if you want more pixels? 

Greater resolution offers you a much finer view of the data: that’s why we needed to read over 200 million web pages just for this work. So Glass is a new kind of lens, and one we hope will make a real difference.

Lastly, we need to make sense of what we read from the web. So-called ‘natural language understanding’ (the kind we humans use, as opposed to the kind computers do) is a hard problem. For example, humans don’t just consume or create a stream of unambiguous symbols: words are both slippery and locally sensitive: we understand what words really mean from context, from the words around the words. Even the absence of certain words can determine a completely different context, hence a different meaning. So our challenge is not just to find a bunch of keywords, but to make sense of how language actually works. 

Time for the experiment

 In 2018, being female you can (according to official statistics) expect lower pay, worse prospects of promotion, and greater likelihood of being in one kind of industry over another. But what does this exactly look like and how does it manifest itself?

What can a new artificial intelligence (AI) pointed at websites tell us about the problem that existing methods can’t? How accurate would it be? And could it offer any new insight or uncover more detail on this important question? 

The work

We trained our AI on the entire .UK domain, and read the genders of 2.3m people and the positions they held in 150,000 organisations, across 108 industry sectors. We filtered out holding pages, low content pages, social media sites, retailers, blogs and service-oriented sites, such as search engines, because we wanted to know how UK businesses and organisations depicted themselves. Remember that these organisations have no legal obligations to present their staff with balance, and no expectation of being held to account for their choices: in that sense, the web is an unselfconscious snapshot of the organisations in it, trying to look their best.

Some sectors of the economy are ‘dark’, barely present online - the tobacco industry, for example. Some sectors create a disproportionate amount of noise - media and marketing industries, for example. So inevitably there are skews in the data to be accounted for. 

But one remarkable result is that our figures precisely match the ONS (Office for National Statistics) data at the top level, and underneath, in finer resolution, we see the full picture: massive divergence between the sexes in certain roles and industry sectors. In effect, we see gender segregation at work, with only 5% of the hundred-odd sectors we surveyed showing balanced workforces.

Here’s a snapshot:


Why is this under-reported?

You’d think with such clear and present inequality across sectors, the media would be jumping up and down to document it. Contrary to intuition, it turns out that the media business is no champion of equality either.


Find out more

This was just a brief taster of our study. We believe this unique study opens the door for more AI-based research using the internet as a live mirror of society and points to new ways for monitoring these complex issues, as well as tracking the policy initiatives intended to tackle them.

You can get the research highlights in friendly form here, or you can read the full paper published in the Heliyon journal here. We hope you’ll read and share.

Understanding the characteristics of high growth companies using non-traditional data sources


A new study into high growth companies by the Office for National Statistics (ONS) Data Science Campus has used web content read by Glass AI to understand the characteristics that may lead to high performance.

Research into the characteristics of high growth companies to date has tended to use traditional datasets and methods. “Non-traditional data” in this context broadly refers to data initially collected for a purpose other than statistics, research or administration. For example, data collected about a company from the web.

For this study, Glass shared web content from a random sample of 30,000 UK active companies. Active companies were determined by tracking changes to the web site within a couple of months prior to the delivery. The data included company descriptions, sector classifications, other company mentions, news articles, job adverts and people biographies.

The analysis from the ONS confirmed existing research that it’s difficult to predict high growth firms. However, the analysis of the web content showed that the use of certain key terms and being well networked with other companies are features associated with high growth firms — and given further data these insights could be developed further to help tailor targeting and policy to help businesses that could potentially be high growth.

Read the full report here.

Using web content to better understand business activities


In the UK, Standard Industrial Classification (SIC) codes are used to categorise businesses based on their activity. Policymakers and analysts use this official taxonomy to measure sectors, identify stakeholders to engage with, to develop policies, and to measure the impact of policies. However, SIC codes have three important limitations:

  • First, a high proportion of UK businesses currently classify themselves as ‘Other’. At the moment there is limited evidence about what kind of activities businesses in the ‘Other’ SIC codes are engaged in, which means policymakers have little understanding about the activities of a significant part of the UK economy;

  • Second, some UK businesses are engaged in types of economic activity which do not sit well within SIC. Examples include businesses engaged in ‘low-carbon’ activities or in the ‘immersive economy’. Official industry codes (last updated in 2007) fail to capture these new sectors, resulting in a lack of evidence that hinders policy making;

  • And third, as businesses become increasingly more dynamic, innovative and technology-driven, they also perform cross-sector activities. In this context, SIC codes also fail to accurately capture the variety of business activities.

The UK government is aware of these limitations, and as a result of the Review of Economic Statistics¹, it has started investigating the economic activities undertaken by UK businesses and how these are reflected in the SIC codes.

At Glass, we believe that textual data from websites can provide deep insights into the economic activities of businesses. We’ve developed AI technology that reads and interprets the web, and in the UK our engine has mapped — for the first time — the entire economy based on its web presence, that is 1.4 million UK businesses across sectors and geographies.

With this new capability we decided to run a new experiment.


We investigated the activities of UK businesses classified as ‘Other’ and ‘None supplied’ within the official SIC codes taxonomy. To do this, we took a random sample of UK businesses and mapped their web data in Glass against their official information in Companies House (CH). Our process followed several steps within two main parts: core technology and mapping.

Core technology

Reading the websites

To identify the UK businesses, our crawler was set to read websites that target a UK audience or have adopted the .uk domain address. Websites were considered if they were written in English, had mentioned a UK address on their pages, and had some depth of representation for the business in question.

Starting with over 200 million web pages, our engine identified approximately 1.4m UK businesses with a website. Each website was read and relevant text entities (e.g. business descriptions, addresses, people) were detected with state-of-the-art precision (> 95%). The different entities were identified using an AI model that considers multiple features such as location on the web page, use of specific keywords and phrases, sentence structure etc.

Assigning sector(s)

Based on the descriptions, key topics, links on the homepage and other attributes, the businesses were automatically classified into one or more economic sectors. The Glass sector taxonomy is comprised of 108 sectors and it has been trained using a sample of sector classifications from LinkedIn. Businesses with well-defined attributes were assigned a single sector, while those with diversified activities had multiple sector predictions. For the purpose of this research, we only considered the first (and the most representative) sector.

Companies House Mapping


After assigning the sectors, from the 1.4m UK businesses we had information on, we randomly selected a sample of 400k businesses with address¹ information. Then we used the CH data to get the name, addresses and SIC codes for the companies.

Pre-processing & matching

From the CH dataset, we selected only active and non-dormant companies. At this point, both CH and the Glass business names were cleaned/normalised (e.g., punctuation, stop words, whitespaces, company type abbreviations, etc.). We performed the matching exercise of official data with web data using a fuzzy match on name and exact match on postcode. Since the addresses represented a significant metric in mapping, we excluded businesses with Registered Addresses different from the Trading Addresses. To do the name matching, we used multiple similarity/dissimilarity metrics. The best ones were Jaccard Index, Cosine Similarity and the overlapping number of words.

Glass to SIC results

This exercise resulted in 100k organisations² that where successfully matched. The top matched SIC codes had accurate equivalents in the Glass sector classification (Table 1).

Table 1. Top SIC (by matches) to five Glass sectors

Table 1. Top SIC (by matches) to five Glass sectors

Analysis of ‘Other’ SIC codes

Approximately 6% of all the SIC codes are labelled as ‘Other’. More strikingly, on the full CH data set³, the current SIC taxonomy fails to completely capture activity information for almost one-third of UK businesses (that is, approx. 30% of businesses in CH are classified as ‘Other’⁴). This is strong evidence that many registered UK companies do not seem to have chosen — or could not choose — an accurate SIC code and are therefore miss-classified and misunderstood from a policy making perspective.

In our analysis, we saw that the SIC code Other business support service activities had the most matches with the web data (18.14%; 5175 businesses) (Table 2). This SIC code, along with Other service activities was also the most diverse when it comes to sector coverage (comprising 103 out of 108 Glass sectors).

Table 2. Top ‘Other’ SIC codes

Table 2. Top ‘Other’ SIC codes

We further examined the top two ‘Other’ SIC codes. First, we looked at their sector distribution with regard to the Glass sectors, and second, we inspected the proportion of ‘Other’ within each sector. The top two SIC codes were the most ambiguous about company activities, even though they were part of clearly defined SIC sections⁵.

We also discovered that businesses performing ‘Staffing-related’ activities had the highest proportion (5.2%) in all the ‘Other’ SIC codes (Table 3). This could mean that this sector has one of the poorest SIC descriptions or it could mean that Staffing-related companies tend to perform cross-sector activities. We noticed a similar situation, but at a lower proportion with ‘Hospitals’. By contrast, companies specialising in ‘Jewellery and Wholesale’ accounted for the lowest share of ‘Other’. The top two ‘Other’ SIC codes had a slightly different sector distribution, with most companies in Financial Services and Professional Training sectors.

Table 3. Glass sectors with highest/lowest matches among ‘Other’ SIC

Table 3. Glass sectors with highest/lowest matches among ‘Other’ SIC

Another insight was that more than a half of ‘Staffing’ and ‘Health & Wellness’ businesses would classify themselves as ‘Other’ (Table 4). Why is this figure so high? This is an area of further research that could be addressed using additional data.

Table 4. Glass sectors with highest/lowest proportions of ‘Other’ SIC codes

Table 4. Glass sectors with highest/lowest proportions of ‘Other’ SIC codes

Analysis of ‘None Supplied’ SIC codes

In CH, missing information on company activity is evidenced through the ‘None Supplied’ SIC codes. Choosing a SIC code at the moment a company is set up⁶ has been mandatory since 2016. Previously, the data was provided on the first annual return (now called the confirmation statement).

Based on our analysis, 5.7% of registered CH businesses did not provide a SIC code. In terms of our matching with the web data, we got a total 3.2% businesses with a ‘None Supplied’ SIC code. This could suggest that these businesses are less likely to have a web presence.

The sector ‘Law Practice and Services’ in Glass was the dominant sector among companies with a ‘None Supplied’ SIC code (Table 5). One possible explanation is that maybe law firms tend to be partnerships (i.e. not registered in CH) and as a result there isn’t an appropriate SIC for law firms in the official taxonomy. We learned that this is not the case, as the ‘Legal and Accounting’ SIC code can capture the activities of law firms. We noticed that the top Glass sectors in the ‘None Supplied’ SIC category are professional services sectors. Certainly with the use of text rich company descriptions and topics data from business websites we can get a better understanding of what these businesses actually do.

Table 5. ‘None Supplied’ SIC — Glass sector breakdown

Table 5. ‘None Supplied’ SIC — Glass sector breakdown


This quick matching experiment of web data with official data shed some light to the kind of activities UK businesses in the ‘other’ SIC codes are engaged in. Professional services businesses related to staffing and training seem to be the most poorly classified in Companies House. Also, we have learned that law, accounting and investment-related businesses do not always choose a descriptive SIC code. This in itself could be an interesting line of enquiry for another piece of research.

More detailed research could also be done with the UK Glass data. For example, we could look at the specific topics that companies use to describe their activities, we can help analysts categorise businesses that are active in various sectors and, as shown with several reports, the open web also allows us to better understand the sizes of emerging sectors that do not sit well within official statistics.

End notes

[1] We limited the number of addresses to ten per company.

[2] 96% accuracy.

[3] Active and Non-Dormant companies.

[4] SIC codes labelled ‘Other’ capture more or less Industry information. For example, Other service activities do not offer enough information on company activity, whereas Other manufacturing gives a clear indication of the Industry.

[5] Each SIC code is part of a broader industry section. 82990 belongs to section N (Administrative And Support Service Activities) and 96090 is part of section S (Other Service Activities).


Bean, C. (2016). Independent review of UK economic statistics. HM Treasury, Cabinet Office, The Rt Hon Matt Hancock MP and The Rt Hon George Osborne MP11.


A comparison of UK sectors based on web presence and official statistics


Company websites have become an essential marketing tool to promote brands, products and services, to attract future employees, collaborate with partners, and to interact with current and potential customers. At Glass we believe that any organisation that “matters” in the economy is likely to have a website. Furthermore, we believe that company websites can provide useful clues around the sizes, the strategies, the networks, the sectors and the growth rates of companies, as this recent Office for National Statistics (ONS) post suggests.

In this blog post, we analyse the web presence of the different sectors in the UK economy and compare the results with the breakdown of sectors presented by official statistics.

The web is a digital copy of a large part of the economy but there are differences in the web representation of sectors when compared to official data.

Some sectors that have a large volume of companies in the official statistics seem to have a smaller presence on the web. This post aims to uncover some of the differences and suggest potential explanations.

UK businesses with a website

According to the UK Office for National Statistics, 45% of all the UK’s businesses have a website. This percentage seems quite low so we decided to investigate further. Our data scientists carried out some tests using Companies House data and the results showed that approximately 30% of all UK registered companies have a website. Surprisingly, the percentages were even lower than the numbers provided by the ONS. This means that out of 4 million companies registered in Companies House, approximately 1.2M companies (30% of the total) have a website. It is worth noting that not all UK “businesses” are listed in Companies House. The ONS Business Population Estimates (BPE) has a total of 5.7 million businesses, including the 4 million companies from Companies House and also sole traders, partnerships, and government organisations.

Based on these numbers, we have estimated there are approximately 1.7 million organisations in the UK with a website, 1.2 million of which are companies. The Glass AI engine currently reads and interprets the websites of 1.5 million UK businesses.

The relationship between sector and web presence

Using the UK web data and the official data from Companies House, we followed several steps to understand the presence of different sectors online versus their share in official statistics:

First, we mapped the official SIC codes to the Glass sector taxonomy of 108 sectors. The SIC classification contains 732 codes which are part of broader industry sections. In our experiment, we managed to assign Glass sectors to most SIC codes. For each SIC, the most representative Glass sector was the one with the maximum number of matches.

In a second step, we decided to exclude from the Glass and Companies House datasets those organisations that are outside the private sector. Also, we excluded sectors that were poorly represented, including:

  • Governmental organisations;

  • Charities and Foundations;

  • Schools and Universities;

  • Sectors with insufficient information.

Third, after filtering and assigning the different sectors, the Glass to SIC mapping covered 85 Glass sectors (out of 108). Depending on the activity, each sector had a different number of corresponding SIC codes (see examples in Table 1). Food and Beverages had the highest number of mapped SIC codes and covered activities related to production, distribution and services in food and beverages. On the other side, the sector Libraries had only one SIC code assigned as it references a very specific activity.

Table 1. Example of Glass sectors mapped to SIC codes

Table 1. Example of Glass sectors mapped to SIC codes

As mentioned earlier, the Companies House dataset does not cover the entire UK economy: it contains information on about 4m UK businesses, of which a sizeable proportion are dormant. At this point for each group of SIC codes, we determined the number of active and non-dormant registered businesses and calculated the relative proportion of each group in Companies House. A similar approach was used to determine the share of sectors based on the UK web data.

The final step was to compare the UK web with the Companies House sector breakdown. We aimed to quantify the differences and identify which sectors seem to be overrepresented or underrepresented on the web. We also performed a geographical comparison for the UK. As seen in Figure 1, each UK region has a slightly different representation on the web by volume (compared to official statistics).

Figure 1. Web versus Companies House

Figure 1. Web versus Companies House

Results of the sectors analysis

Several factors can influence the representation of companies and sectors on the web and official statistics:

  • As we know, not all officially registered companies have a website;

  • Companies from specific sectors are more likely to have a website;

  • Some UK regions and counties may have a different business presence on the web. This can be influenced by the regional industry composition or policies of regional and local governments. Geographical (Regional) comparison is a very interesting research topic which we aim to address in a future analysis.

You can view the data used for our analysis here.

Sectors with the highest share

The Glass sectors Real Estate & Property Management and Construction were the top sectors by volume in the official statistics. This is not surprising as the 2017 BPE figures showed Construction as the top industry by the number of UK private businesses. We noticed that the proportions for these two sectors were quite high (> 8.5%) given that the overall Companies House sector distribution is highly varied.

The analysis of the web data, however, showed a different behaviour: we saw lower proportions for the dominant sectors and less extreme values. Hospitality and Restaurants (4.3%) was the top sector, closely followed by Construction with 4.2% of the total number of businesses.

Table 2. Sectors with the highest share by volume (representation)

Table 2. Sectors with the highest share by volume (representation)

Initial conclusion was that the web and official data seemed to show different sector breakdowns. One potential reason for the difference is that not all registered businesses have a website, and this is particularly common in some specific sectors. The next part of our analysis tried to identify the reasons for such differences by looking at differences between sector shares.

Sectors representation on the web

More than two-thirds of the Glass sectors (78%) had a higher web presence than expected based on the official statistics and 19% of the sectors had a lower web presence. Those sectors with higher than expected web presence were, on average, 0.5% more present online than in the official data, while for those sectors with a bigger Companies House share, the average was 1.4%.

Health, Wellness and Fitness businesses were 2.47% more represented on the web than in Companies House, while Real Estate and Property Management had a 6.62% share bigger in Official Statistics.

Could it be that many ConstructionReal Estate and Consulting businesses choose not to have a website? It may be that most companies in these sectors are too small to have a website or that the nature of the sector means that businesses do not necessarily need a website. Another hypothesis is that many individuals in these sectors may choose to set up limited companies to offer their services in a more tax-efficient way compared to self-employment.

Table 3. Differences between sector shares

Table 3. Differences between sector shares

In addition, we saw some interesting patterns when we further inspected which sectors are overrepresented in Companies House and on the web. These sectors could be clustered around two types of activities:

1. Sectors under-represented on the web that can have lots of micro-businesses (e.g. ConsultingConstruction) or focused on professional services activities (where people are the “product/service”). According to the BPE (2017), 83% of Construction businesses are “sole proprietorships and partnerships with only a self-employed owner-manager and companies with one employee, assumed to be an employee director”. We could argue that for small and micro businesses in the construction sector, having a website may not add that much value. We also noticed that the Retail and Food and Beverages sectors had a higher share in Companies House compared to their share of the UK web. This could be related to Companies House including many local businesses and corner shops registered as companies, which again, might not necessarily need a website.

2. Sectors over-represented on the web, for example, sectors in Leisure-related activities (e.g. Wellness & FitnessTravelPhotography). These sectors are more outwardly facing, possibly aiming for broader audiences. We suspect these sectors may have many sole traders that are not necessarily registered as a business in Companies House. For this types of businesses, having an online presence brings a significant advantage.


This sector comparison of official data with web data was another attempt to understand and gain insights into the UK economy with web data. We discovered that some sectors with a large share in the official statistics had a smaller presence on the web. At the same time, other sectors seemed to be overrepresented on the web, probably due to the importance of having a website for their economic activity. This was our first attempt to look into this area.

Our next blog post will try to identify the top companies in the UK based on their web presence, in other words, we will use the open web data as a proxy for estimating the size and importance of UK companies.

Flying High, a huge study of the UK drones industry with Glass data


This week has seen the launch of a huge study from Nesta, in partnership with Innovate UK, mapping the UK Drones industry with Glass data. The 225-page report explores how the UK can become a world leader in drone technology. It also outlines some of the challenges facing urban implementations and makes some policy recommendations.

How did Glass help?

We were introduced to the team at the Nesta Challenge Prize Centre. They needed help trying to identify companies already operating in the drones sector and UK universities with research strengths in the area. Glass has mapped the entire UK economy based on its web presence, so using the product (currently in private Beta) we were able to quickly find 700+ relevant organisations in the UK. You can see the results in this excellent interactive map.

Usually to produce this type of research, data scientists and market analysts have to spend time on Google or rely on official data. They may also have to pay substantial amounts of money in database subscriptions or to buy reports. With Glass, it was possible to map and gather knowledge about the UK drones ecosystem in a few minutes. It didn’t take days or weeks to complete.

We believe that the web, the largest source of knowledge ever created, can provide a lot of insights into markets, emerging themes, economic activity and society. This report is another example of what’s possible with open web data and the product we are building. Stay tuned, we are just getting started!

Digital Catapult selects Glass for their Machine Intelligence Garage


We are pleased to announce that Glass has been selected to take part in the Machine Intelligence Garage, a programme launched by Digital Catapult to support the UK’s role as a global centre for artificial intelligence development.

The Garage is designed to help promising AI startups with a well defined business idea and technical capability for whom access to computation power is a barrier to growth. As a young technology startup, at Glass we are limited by the amount of computation power that we can use. Participation in the Machine Intelligence Garage will significantly increase the amount of open web data that we can read to train and improve our intelligent crawler.

Applications were assessed based on the strength of the idea submitted and technical implementation plan, availability of data, and the immediacy of the need for computation power. The companies’ ethical use of data was also paramount.

We continue in our mission is to read the web to create an unparalleled resource for the world’s researchers.

Creative Nation, a new study using structured web data from Glass

Nesta, a global innovation foundation, has launched a new report that combines official and open web data to map the creative industries in the UK. The study was produced in collaboration with the Creative Industries Council.

The report highlights that creative industries are driving economic growth across the UK, on track to create one million new creative industries jobs between 2013 and 2030.

We helped Nesta map the scale of the creative industries across the UK. It’s very hard to measure the state of new sectors or emerging fields. Official industry codes that are used to measure the economy fail to capture new sectors, so official data is not very useful for measuring them. This results in a lack of evidence that hinders policy-making and the market research efforts of companies.

Nesta used Glass because our intelligent crawler has digitally mapped (for the first time) the UK economy, tracking any topic of interest across hundreds of millions of web pages, watching over a million organisations. This new resource that we are building allowed Nesta to identify businesses engaged in ‘creative’ activities and businesses in the UK ‘creative economy’.

Juan Mateos-Garcia, Director of Innovation Mapping at Nesta, said: “With Glass, finding relevant companies for our research took no time at all. We could not have achieved this without their ability to read the web at scale”.

Launch of a new report using Glass: the Immersive Economy in the UK


A new report has been published using structured web data from Glass. The study has been the first of its kind to map the UK immersive sector. Commissioned by Immerse UK with funding from Innovate UK, the report provides hard data about the size of the sector, its performance, its geography, the drivers of success and the barriers to growth. It has identified and mapped the organisations developing and applying these exciting new technologies in the UK. The sector is growing rapidly with 1,000 specialist companies with an expected turnover that could reach £1bn this year.

The report is another great example of what’s possible with open web data. With Glass, finding relevant companies for the research took no time at all. The analysis was then prepared by Nesta, a global innovation foundation. MTM London, a research consultancy, also conducted a business survey and in-depth interviews as part of the report.

Here is an excellent read from Nesta where they explain how the report came together (and the role Glass played in it). It highlights a new exciting approach to market research and innovation mapping using web data and machine learning techniques.