Are Vanity Awards worth it? An experiment in identifying web influence.

Winner.png

You may have received an email yourself, congratulating your company on being named in the Top 30 or 50 or 100 Best, Fastest or Friendliest companies in A, B or C industry. To receive your award all you need to do it pay a couple of $1,000 and a whole host of positive publicity is yours!

If you do a quick search for these awards then you will find some negative comments on social media, blogs pieces and other articles. But you will also find a lot of content published about the awards, so what’s the story. Are these awards useful for generating publicity — maybe they are worth it?

At Glass, we have developed artificial intelligence that is able to understand natural language at scale and have pointed it at the internet to digitally map the world’s economy. Using our AI, we investigated whether these awards are useful marketing tools based on the footprint they leave on the web.

Vanity_Chart.png

Share of web presence for vanity award news sources.

We know that there are negative comments about these awards in social media and other articles, but it is interesting to look at the content that is being generated by these awards when they are published. Looking at a sample of awards over the past year, it appears that the reach of the awards fits into 3 categories: PR distribution services, copy and paste content and the companies that have received the awards themselves.

  • PR distribution services typically offer low-cost mechanisms for sharing a press release to large numbers of online services and news outlets. They offer a spray and pray approach to public relations and seem to be the major outlet for the releases published by the awards.

  • The award releases do seem to occasionally appear on other sites, but if you dig a little deeper then you see that the content is identical to parts of the original release and the sites they appear on are minor company blogs; probably where the owner is looking for any sort of story related to their own industry to help with their content marketing. Interestingly, given the nature of awards, these sorts of stories seem to get picked up less often — because who wants to advertise competing companies — than other releases published by the PR distributors.

  • As you would expect the companies receiving the awards also publish information about the awards on their own websites. This is no surprise but it is interesting to view the types of companies that are receiving the “top” awards. There are certainly exceptions but if you look at other mentions they are having in the press then many of the organisations seem to have had very little impact on their own industry.

The evidence suggests there is very little chance of one of these award categories being picked up by any mainstream or regular media outlets. So whilst they do generate a footprint on the web, as an organisation you might be better off developing your own unique story and sending it to the PR distribution services directly — although another interesting piece of web analysis might be to examine the reach and impact of these services themselves. That said, nothing beats developing a good direct relationship with journalists covering your sector.

The OECD uses Glass to understand vast amounts of text and gather insights from the web.

OECD 17.58.25.png

The Organisation for Economic Co-operation and Development (OECD) today held a high-level event in Paris to map the way forward on digital era policymaking. We were kindly invited to attend as the event also saw the launch of a huge report measuring the state of digital transformation. Glass has contributed to this report and we continue to work with the OECD on other exciting projects. 

Measuring the Digital Transformation

Sound measurement is crucial for evidence-based policy making, but existing ‘official’ metrics and measurement tools struggle to keep up with the rapid pace of digital transformation. Governments and leading economic institutions like the OECD recognise this and are exploring new complementary approaches. Here is where Glass comes in. 

After discussions with senior economists and data scientists at the OECD, it was suggested that an initial case study would be to map the AI ecosystem in the UK. Using the Glass AI capability, the OECD team found the following:

Case Study: mapping the activities of UK AI companies

AI companies in different industries are developing and applying different types of AI-related technologies. Everybody holds that Artificial Intelligence is permeating all sectors of the economy. However, little is known about which type of AI technologies and approaches are being developed and used in different sectors, and for which purpose. 

Using the Glass AI capability that understands web text at scale, the OECD has revealed the existence of about 6 thousand AI-related companies in the United Kingdom alone in 2018, about 2.8 thousands of which explicitly mentioning on their website to be active in AI. These companies appear to combine different AI-related technologies and approaches, depending on their application field or area of activity. For instance, about 400 companies are focusing on deep learning and are relying on automation-related technologies and, to a lesser extent, on data analytics. About 300 companies advancing AI in robotics, the Internet of Things (IoT) and virtual reality (VR) are focusing on automation and, to a lesser extent, on natural language processing (NLP). About 250 AI companies are focusing on analytics coupled with recognition-related technologies when developing e-commerce-related AI technologies purposes. About as many companies rely on different combinations of the same technologies in their data mining and business solutions-related developments, respectively.

1_mpOTngcunV74RqMz73V6wA.png

AI-related companies in the United Kingdom, by the focus of activity, 2018

AI companies: shaping key sectors of the UK economy

The OECD has also gained more insights about the type of AI technologies that these companies are developing and applying by narrowing the focus on a few key sectors of the UK economy. In particular, Financial Services, Professional Services, and ICT manufacturing and service activities, which in 2017 accounted for 22.7% of total employment (7.3 million persons, up from 6.0 million in 2010) and for 53% of investment (i.e. gross fixed capital formation, GFCF) in ICT equipment.

Of the 2.8 thousand UK AI companies in the sample that explicitly state to be actively pursuing AI-related activities, 829 appear to operate in ICT manufacturing and services activities, 693 to professional activities and 162 to financial and insurance activities, a total of sixty percent of the sample. The other forty percent is distributed across ten sectors ranging from agriculture to real estate and construction. Some of them are developing and using several types of AI-related technologies, whereas others appear to be very much focused on one only area. Also, different technologies appear to be developed to relatively different extents. UK AI-active companies in ICT manufacturing and services are focusing their efforts in technologies related to language processing, business solutions and deep learning. By contrast, companies in Professional services are especially concerned with language processing, image recognition and robotics, IoT and virtual reality-related technologies. Finally, finance and insurance companies appear to be especially active in autonomous vehicles-related technologies, in deep learning, and in robotics, IoT and virtual reality.

OECD2.png

AI-related technologies developed by UK companies, by the main sectors, 2018

Summary

This study has demonstrated again the potential of AI-based research using the open web as a live mirror of complex social and economic issues. We are entering a new era of economic measurement where national statistical offices will have to open up to new technologies and sources of data. We are very excited that the OECD is now using Glass to expand their evidence base and to gather economic insights from the web. It’s the start of a collaboration that will lead to many exciting opportunities and it is also helping us in our mission to digitally map the world’s economy. More to come!

Better Language Model Benchmarks Needed

1_QJX-pWc4MMr_ckpXbE2chA.png

Recent language modelling research published by OpenAI has gained quite a bit of coverage. The stir has largely been around its decision to withhold sharing the full models or code used in the research because it claims that it could be used for malicious purposes (e.g. fake news generation, simulating a persons behaviour online). The backlash online has even suggested that — maybe — these were withheld precisely to generate more noise. The potential for malicious use has certainly been what many have focused on as reflected in this article on the BBC.

What has been mostly overlooked is the research itself. The research has built a language model (GPT-2) from 40GB of text data collected from the web and has claimed that the model has been used to gain state of the art results on a wide variety of language tasks; from text generation, question answering, translation, reading comprehension and summarisation. The researchers state this is particularly significant because unlike other models which have been trained using domain specific supervised learning methods, the GPT-2 model has been trained against a general corpus and with no knowledge of the language tasks.

This may indeed be impressive, but if you consider how the language model achieves these results then it reflects very badly on the current state of the art in Natural Language Processing and the current set of tests that are used by researchers to track progress. The OpenAI model has been built simply to predict what the next most likely word is given the words that precede it. No sense it is reading language, no understanding, nothing. You could consider it nothing more than a parlour trick, albeit one that demonstrates impressive results. Results which are probably due to the large scale of text consumed to build the models.

At Glass we are also using the web as our corpus, but we are building language and meaning into our models from the start to deliver language understanding with the support of common sense knowledge. We believe this is essential if progress is going to be made towards robust language understanding. We have begun by modelling the language that a business uses to talk about themselves to see what insights our AI can gain about market sectors and the economy at large. Over time we hope these models will grow to cover many other domains.

What the research from OpenAI does highlight is that the community needs better ways of testing language understanding if we are going to make any significant progress towards getting machines to better understand language. For that, we thank them.

How we used our AI to uncover gender bias in the UK workplace

Background

 At Glass, we have developed a new system for deriving large scale social and economic insights from the web and other sources. Our AI can understand written language. 

 In today’s post, we’re going to walk you through a formal experiment in data science we recently completed, where we looked at a pressing and surprisingly underreported real-world problem - gender inequality in the UK workplace. 

We’ll explain why we took on the challenge of trying to read the entire .UK domain to do this, and some of the issues we encountered on the way. To our knowledge, this is the first systematic analysis at this scale, and our results make for disquieting reading for business leaders (both great and small) across the UK. 

How our work is different

The internet is big. And it keeps getting bigger. So just as an astronomer might be interested in the formation of star systems, at Glass we’re interested in understanding large-scale activity, as revealed through the trace of activities seen across the world's universe of constantly expanding published content. This includes the ability to track information on the move as it changes shape: for example, the dynamics of a news topic as it unfolds over time. How did it get there? Where’s it going next? Who’s talking about what, where? 

Previous related studies created for economists, policy-makers, or business analysts have tended to underuse or even ignore the web as a data source, typically only looking in any detail at a limited number of sectors of the economy, examining a small slice of geography or conducting manual (and expensive) surveys. Worse, given a small data set, data scientists have no choice but to extrapolate and rely on small sample statistics. 

This is fine... if you want a big, blurry picture. But what if you want more pixels? 

Greater resolution offers you a much finer view of the data: that’s why we needed to read over 200 million web pages just for this work. So Glass is a new kind of lens, and one we hope will make a real difference.

Lastly, we need to make sense of what we read from the web. So-called ‘natural language understanding’ (the kind we humans use, as opposed to the kind computers do) is a hard problem. For example, humans don’t just consume or create a stream of unambiguous symbols: words are both slippery and locally sensitive: we understand what words really mean from context, from the words around the words. Even the absence of certain words can determine a completely different context, hence a different meaning. So our challenge is not just to find a bunch of keywords, but to make sense of how language actually works. 

Time for the experiment

 In 2018, being female you can (according to official statistics) expect lower pay, worse prospects of promotion, and greater likelihood of being in one kind of industry over another. But what does this exactly look like and how does it manifest itself?

What can a new artificial intelligence (AI) pointed at websites tell us about the problem that existing methods can’t? How accurate would it be? And could it offer any new insight or uncover more detail on this important question? 

The work

We trained our AI on the entire .UK domain, and read the genders of 2.3m people and the positions they held in 150,000 organisations, across 108 industry sectors. We filtered out holding pages, low content pages, social media sites, retailers, blogs and service-oriented sites, such as search engines, because we wanted to know how UK businesses and organisations depicted themselves. Remember that these organisations have no legal obligations to present their staff with balance, and no expectation of being held to account for their choices: in that sense, the web is an unselfconscious snapshot of the organisations in it, trying to look their best.

Some sectors of the economy are ‘dark’, barely present online - the tobacco industry, for example. Some sectors create a disproportionate amount of noise - media and marketing industries, for example. So inevitably there are skews in the data to be accounted for. 

But one remarkable result is that our figures precisely match the ONS (Office for National Statistics) data at the top level, and underneath, in finer resolution, we see the full picture: massive divergence between the sexes in certain roles and industry sectors. In effect, we see gender segregation at work, with only 5% of the hundred-odd sectors we surveyed showing balanced workforces.

Here’s a snapshot:

blog_snapshot1.png
blog_snapshot3.png

Why is this under-reported?

You’d think with such clear and present inequality across sectors, the media would be jumping up and down to document it. Contrary to intuition, it turns out that the media business is no champion of equality either.

blog_gender_percent.png

Find out more

This was just a brief taster of our study. We believe this unique study opens the door for more AI-based research using the internet as a live mirror of society and points to new ways for monitoring these complex issues, as well as tracking the policy initiatives intended to tackle them.

You can get the research highlights in friendly form here, or you can read the full paper published in the Heliyon journal here. We hope you’ll read and share.

Understanding the characteristics of high growth companies using non-traditional data sources

1_cO2soEqr7FHPXYDo6FJuqQ.jpeg

A new study into high growth companies by the Office for National Statistics (ONS) Data Science Campus has used web content read by Glass AI to understand the characteristics that may lead to high performance.

Research into the characteristics of high growth companies to date has tended to use traditional datasets and methods. “Non-traditional data” in this context broadly refers to data initially collected for a purpose other than statistics, research or administration. For example, data collected about a company from the web.

For this study, Glass shared web content from a random sample of 30,000 UK active companies. Active companies were determined by tracking changes to the web site within a couple of months prior to the delivery. The data included company descriptions, sector classifications, other company mentions, news articles, job adverts and people biographies.

The analysis from the ONS confirmed existing research that it’s difficult to predict high growth firms. However, the analysis of the web content showed that the use of certain key terms and being well networked with other companies are features associated with high growth firms — and given further data these insights could be developed further to help tailor targeting and policy to help businesses that could potentially be high growth.

Read the full report here.

Using web content to better understand business activities

Usingweb_blog.jpg

In the UK, Standard Industrial Classification (SIC) codes are used to categorise businesses based on their activity. Policymakers and analysts use this official taxonomy to measure sectors, identify stakeholders to engage with, to develop policies, and to measure the impact of policies. However, SIC codes have three important limitations:

  • First, a high proportion of UK businesses currently classify themselves as ‘Other’. At the moment there is limited evidence about what kind of activities businesses in the ‘Other’ SIC codes are engaged in, which means policymakers have little understanding about the activities of a significant part of the UK economy;

  • Second, some UK businesses are engaged in types of economic activity which do not sit well within SIC. Examples include businesses engaged in ‘low-carbon’ activities or in the ‘immersive economy’. Official industry codes (last updated in 2007) fail to capture these new sectors, resulting in a lack of evidence that hinders policy making;

  • And third, as businesses become increasingly more dynamic, innovative and technology-driven, they also perform cross-sector activities. In this context, SIC codes also fail to accurately capture the variety of business activities.

The UK government is aware of these limitations, and as a result of the Review of Economic Statistics¹, it has started investigating the economic activities undertaken by UK businesses and how these are reflected in the SIC codes.

At Glass, we believe that textual data from websites can provide deep insights into the economic activities of businesses. We’ve developed AI technology that reads and interprets the web, and in the UK our engine has mapped — for the first time — the entire economy based on its web presence, that is 1.4 million UK businesses across sectors and geographies.

With this new capability we decided to run a new experiment.

Experiment

We investigated the activities of UK businesses classified as ‘Other’ and ‘None supplied’ within the official SIC codes taxonomy. To do this, we took a random sample of UK businesses and mapped their web data in Glass against their official information in Companies House (CH). Our process followed several steps within two main parts: core technology and mapping.

Core technology

Reading the websites

To identify the UK businesses, our crawler was set to read websites that target a UK audience or have adopted the .uk domain address. Websites were considered if they were written in English, had mentioned a UK address on their pages, and had some depth of representation for the business in question.

Starting with over 200 million web pages, our engine identified approximately 1.4m UK businesses with a website. Each website was read and relevant text entities (e.g. business descriptions, addresses, people) were detected with state-of-the-art precision (> 95%). The different entities were identified using an AI model that considers multiple features such as location on the web page, use of specific keywords and phrases, sentence structure etc.

Assigning sector(s)

Based on the descriptions, key topics, links on the homepage and other attributes, the businesses were automatically classified into one or more economic sectors. The Glass sector taxonomy is comprised of 108 sectors and it has been trained using a sample of sector classifications from LinkedIn. Businesses with well-defined attributes were assigned a single sector, while those with diversified activities had multiple sector predictions. For the purpose of this research, we only considered the first (and the most representative) sector.

Companies House Mapping

Data

After assigning the sectors, from the 1.4m UK businesses we had information on, we randomly selected a sample of 400k businesses with address¹ information. Then we used the CH data to get the name, addresses and SIC codes for the companies.

Pre-processing & matching

From the CH dataset, we selected only active and non-dormant companies. At this point, both CH and the Glass business names were cleaned/normalised (e.g., punctuation, stop words, whitespaces, company type abbreviations, etc.). We performed the matching exercise of official data with web data using a fuzzy match on name and exact match on postcode. Since the addresses represented a significant metric in mapping, we excluded businesses with Registered Addresses different from the Trading Addresses. To do the name matching, we used multiple similarity/dissimilarity metrics. The best ones were Jaccard Index, Cosine Similarity and the overlapping number of words.

Glass to SIC results

This exercise resulted in 100k organisations² that where successfully matched. The top matched SIC codes had accurate equivalents in the Glass sector classification (Table 1).

Table 1. Top SIC (by matches) to five Glass sectors

Table 1. Top SIC (by matches) to five Glass sectors

Analysis of ‘Other’ SIC codes

Approximately 6% of all the SIC codes are labelled as ‘Other’. More strikingly, on the full CH data set³, the current SIC taxonomy fails to completely capture activity information for almost one-third of UK businesses (that is, approx. 30% of businesses in CH are classified as ‘Other’⁴). This is strong evidence that many registered UK companies do not seem to have chosen — or could not choose — an accurate SIC code and are therefore miss-classified and misunderstood from a policy making perspective.

In our analysis, we saw that the SIC code Other business support service activities had the most matches with the web data (18.14%; 5175 businesses) (Table 2). This SIC code, along with Other service activities was also the most diverse when it comes to sector coverage (comprising 103 out of 108 Glass sectors).

Table 2. Top ‘Other’ SIC codes

Table 2. Top ‘Other’ SIC codes

We further examined the top two ‘Other’ SIC codes. First, we looked at their sector distribution with regard to the Glass sectors, and second, we inspected the proportion of ‘Other’ within each sector. The top two SIC codes were the most ambiguous about company activities, even though they were part of clearly defined SIC sections⁵.

We also discovered that businesses performing ‘Staffing-related’ activities had the highest proportion (5.2%) in all the ‘Other’ SIC codes (Table 3). This could mean that this sector has one of the poorest SIC descriptions or it could mean that Staffing-related companies tend to perform cross-sector activities. We noticed a similar situation, but at a lower proportion with ‘Hospitals’. By contrast, companies specialising in ‘Jewellery and Wholesale’ accounted for the lowest share of ‘Other’. The top two ‘Other’ SIC codes had a slightly different sector distribution, with most companies in Financial Services and Professional Training sectors.

Table 3. Glass sectors with highest/lowest matches among ‘Other’ SIC

Table 3. Glass sectors with highest/lowest matches among ‘Other’ SIC

Another insight was that more than a half of ‘Staffing’ and ‘Health & Wellness’ businesses would classify themselves as ‘Other’ (Table 4). Why is this figure so high? This is an area of further research that could be addressed using additional data.

Table 4. Glass sectors with highest/lowest proportions of ‘Other’ SIC codes

Table 4. Glass sectors with highest/lowest proportions of ‘Other’ SIC codes

Analysis of ‘None Supplied’ SIC codes

In CH, missing information on company activity is evidenced through the ‘None Supplied’ SIC codes. Choosing a SIC code at the moment a company is set up⁶ has been mandatory since 2016. Previously, the data was provided on the first annual return (now called the confirmation statement).

Based on our analysis, 5.7% of registered CH businesses did not provide a SIC code. In terms of our matching with the web data, we got a total 3.2% businesses with a ‘None Supplied’ SIC code. This could suggest that these businesses are less likely to have a web presence.

The sector ‘Law Practice and Services’ in Glass was the dominant sector among companies with a ‘None Supplied’ SIC code (Table 5). One possible explanation is that maybe law firms tend to be partnerships (i.e. not registered in CH) and as a result there isn’t an appropriate SIC for law firms in the official taxonomy. We learned that this is not the case, as the ‘Legal and Accounting’ SIC code can capture the activities of law firms. We noticed that the top Glass sectors in the ‘None Supplied’ SIC category are professional services sectors. Certainly with the use of text rich company descriptions and topics data from business websites we can get a better understanding of what these businesses actually do.

Table 5. ‘None Supplied’ SIC — Glass sector breakdown

Table 5. ‘None Supplied’ SIC — Glass sector breakdown

Conclusions

This quick matching experiment of web data with official data shed some light to the kind of activities UK businesses in the ‘other’ SIC codes are engaged in. Professional services businesses related to staffing and training seem to be the most poorly classified in Companies House. Also, we have learned that law, accounting and investment-related businesses do not always choose a descriptive SIC code. This in itself could be an interesting line of enquiry for another piece of research.

More detailed research could also be done with the UK Glass data. For example, we could look at the specific topics that companies use to describe their activities, we can help analysts categorise businesses that are active in various sectors and, as shown with several reports, the open web also allows us to better understand the sizes of emerging sectors that do not sit well within official statistics.

End notes

[1] We limited the number of addresses to ten per company.

[2] 96% accuracy.

[3] Active and Non-Dormant companies.

[4] SIC codes labelled ‘Other’ capture more or less Industry information. For example, Other service activities do not offer enough information on company activity, whereas Other manufacturing gives a clear indication of the Industry.

[5] Each SIC code is part of a broader industry section. 82990 belongs to section N (Administrative And Support Service Activities) and 96090 is part of section S (Other Service Activities).

References

Bean, C. (2016). Independent review of UK economic statistics. HM Treasury, Cabinet Office, The Rt Hon Matt Hancock MP and The Rt Hon George Osborne MP11.

https://www.qualitycompanyformations.co.uk/blog/choosing-sic-code-limited-company/

A comparison of UK sectors based on web presence and official statistics

comparison_blog.jpg

Company websites have become an essential marketing tool to promote brands, products and services, to attract future employees, collaborate with partners, and to interact with current and potential customers. At Glass we believe that any organisation that “matters” in the economy is likely to have a website. Furthermore, we believe that company websites can provide useful clues around the sizes, the strategies, the networks, the sectors and the growth rates of companies, as this recent Office for National Statistics (ONS) post suggests.

In this blog post, we analyse the web presence of the different sectors in the UK economy and compare the results with the breakdown of sectors presented by official statistics.

The web is a digital copy of a large part of the economy but there are differences in the web representation of sectors when compared to official data.

Some sectors that have a large volume of companies in the official statistics seem to have a smaller presence on the web. This post aims to uncover some of the differences and suggest potential explanations.

UK businesses with a website

According to the UK Office for National Statistics, 45% of all the UK’s businesses have a website. This percentage seems quite low so we decided to investigate further. Our data scientists carried out some tests using Companies House data and the results showed that approximately 30% of all UK registered companies have a website. Surprisingly, the percentages were even lower than the numbers provided by the ONS. This means that out of 4 million companies registered in Companies House, approximately 1.2M companies (30% of the total) have a website. It is worth noting that not all UK “businesses” are listed in Companies House. The ONS Business Population Estimates (BPE) has a total of 5.7 million businesses, including the 4 million companies from Companies House and also sole traders, partnerships, and government organisations.

Based on these numbers, we have estimated there are approximately 1.7 million organisations in the UK with a website, 1.2 million of which are companies. The Glass AI engine currently reads and interprets the websites of 1.5 million UK businesses.

The relationship between sector and web presence

Using the UK web data and the official data from Companies House, we followed several steps to understand the presence of different sectors online versus their share in official statistics:

First, we mapped the official SIC codes to the Glass sector taxonomy of 108 sectors. The SIC classification contains 732 codes which are part of broader industry sections. In our experiment, we managed to assign Glass sectors to most SIC codes. For each SIC, the most representative Glass sector was the one with the maximum number of matches.

In a second step, we decided to exclude from the Glass and Companies House datasets those organisations that are outside the private sector. Also, we excluded sectors that were poorly represented, including:

  • Governmental organisations;

  • Charities and Foundations;

  • Schools and Universities;

  • Sectors with insufficient information.

Third, after filtering and assigning the different sectors, the Glass to SIC mapping covered 85 Glass sectors (out of 108). Depending on the activity, each sector had a different number of corresponding SIC codes (see examples in Table 1). Food and Beverages had the highest number of mapped SIC codes and covered activities related to production, distribution and services in food and beverages. On the other side, the sector Libraries had only one SIC code assigned as it references a very specific activity.

Table 1. Example of Glass sectors mapped to SIC codes

Table 1. Example of Glass sectors mapped to SIC codes

As mentioned earlier, the Companies House dataset does not cover the entire UK economy: it contains information on about 4m UK businesses, of which a sizeable proportion are dormant. At this point for each group of SIC codes, we determined the number of active and non-dormant registered businesses and calculated the relative proportion of each group in Companies House. A similar approach was used to determine the share of sectors based on the UK web data.

The final step was to compare the UK web with the Companies House sector breakdown. We aimed to quantify the differences and identify which sectors seem to be overrepresented or underrepresented on the web. We also performed a geographical comparison for the UK. As seen in Figure 1, each UK region has a slightly different representation on the web by volume (compared to official statistics).

Figure 1. Web versus Companies House

Figure 1. Web versus Companies House

Results of the sectors analysis

Several factors can influence the representation of companies and sectors on the web and official statistics:

  • As we know, not all officially registered companies have a website;

  • Companies from specific sectors are more likely to have a website;

  • Some UK regions and counties may have a different business presence on the web. This can be influenced by the regional industry composition or policies of regional and local governments. Geographical (Regional) comparison is a very interesting research topic which we aim to address in a future analysis.

You can view the data used for our analysis here.

Sectors with the highest share

The Glass sectors Real Estate & Property Management and Construction were the top sectors by volume in the official statistics. This is not surprising as the 2017 BPE figures showed Construction as the top industry by the number of UK private businesses. We noticed that the proportions for these two sectors were quite high (> 8.5%) given that the overall Companies House sector distribution is highly varied.

The analysis of the web data, however, showed a different behaviour: we saw lower proportions for the dominant sectors and less extreme values. Hospitality and Restaurants (4.3%) was the top sector, closely followed by Construction with 4.2% of the total number of businesses.

Table 2. Sectors with the highest share by volume (representation)

Table 2. Sectors with the highest share by volume (representation)

Initial conclusion was that the web and official data seemed to show different sector breakdowns. One potential reason for the difference is that not all registered businesses have a website, and this is particularly common in some specific sectors. The next part of our analysis tried to identify the reasons for such differences by looking at differences between sector shares.

Sectors representation on the web

More than two-thirds of the Glass sectors (78%) had a higher web presence than expected based on the official statistics and 19% of the sectors had a lower web presence. Those sectors with higher than expected web presence were, on average, 0.5% more present online than in the official data, while for those sectors with a bigger Companies House share, the average was 1.4%.

Health, Wellness and Fitness businesses were 2.47% more represented on the web than in Companies House, while Real Estate and Property Management had a 6.62% share bigger in Official Statistics.

Could it be that many ConstructionReal Estate and Consulting businesses choose not to have a website? It may be that most companies in these sectors are too small to have a website or that the nature of the sector means that businesses do not necessarily need a website. Another hypothesis is that many individuals in these sectors may choose to set up limited companies to offer their services in a more tax-efficient way compared to self-employment.

Table 3. Differences between sector shares

Table 3. Differences between sector shares

In addition, we saw some interesting patterns when we further inspected which sectors are overrepresented in Companies House and on the web. These sectors could be clustered around two types of activities:

1. Sectors under-represented on the web that can have lots of micro-businesses (e.g. ConsultingConstruction) or focused on professional services activities (where people are the “product/service”). According to the BPE (2017), 83% of Construction businesses are “sole proprietorships and partnerships with only a self-employed owner-manager and companies with one employee, assumed to be an employee director”. We could argue that for small and micro businesses in the construction sector, having a website may not add that much value. We also noticed that the Retail and Food and Beverages sectors had a higher share in Companies House compared to their share of the UK web. This could be related to Companies House including many local businesses and corner shops registered as companies, which again, might not necessarily need a website.

2. Sectors over-represented on the web, for example, sectors in Leisure-related activities (e.g. Wellness & FitnessTravelPhotography). These sectors are more outwardly facing, possibly aiming for broader audiences. We suspect these sectors may have many sole traders that are not necessarily registered as a business in Companies House. For this types of businesses, having an online presence brings a significant advantage.

Conclusion

This sector comparison of official data with web data was another attempt to understand and gain insights into the UK economy with web data. We discovered that some sectors with a large share in the official statistics had a smaller presence on the web. At the same time, other sectors seemed to be overrepresented on the web, probably due to the importance of having a website for their economic activity. This was our first attempt to look into this area.

Our next blog post will try to identify the top companies in the UK based on their web presence, in other words, we will use the open web data as a proxy for estimating the size and importance of UK companies.

Flying High, a huge study of the UK drones industry with Glass data

flying_high_blog.jpg

This week has seen the launch of a huge study from Nesta, in partnership with Innovate UK, mapping the UK Drones industry with Glass data. The 225-page report explores how the UK can become a world leader in drone technology. It also outlines some of the challenges facing urban implementations and makes some policy recommendations.

How did Glass help?

We were introduced to the team at the Nesta Challenge Prize Centre. They needed help trying to identify companies already operating in the drones sector and UK universities with research strengths in the area. Glass has mapped the entire UK economy based on its web presence, so using the product (currently in private Beta) we were able to quickly find 700+ relevant organisations in the UK. You can see the results in this excellent interactive map.

Usually to produce this type of research, data scientists and market analysts have to spend time on Google or rely on official data. They may also have to pay substantial amounts of money in database subscriptions or to buy reports. With Glass, it was possible to map and gather knowledge about the UK drones ecosystem in a few minutes. It didn’t take days or weeks to complete.

We believe that the web, the largest source of knowledge ever created, can provide a lot of insights into markets, emerging themes, economic activity and society. This report is another example of what’s possible with open web data and the product we are building. Stay tuned, we are just getting started!

Digital Catapult selects Glass for their Machine Intelligence Garage

digitalcatapult_blog.jpg

We are pleased to announce that Glass has been selected to take part in the Machine Intelligence Garage, a programme launched by Digital Catapult to support the UK’s role as a global centre for artificial intelligence development.

The Garage is designed to help promising AI startups with a well defined business idea and technical capability for whom access to computation power is a barrier to growth. As a young technology startup, at Glass we are limited by the amount of computation power that we can use. Participation in the Machine Intelligence Garage will significantly increase the amount of open web data that we can read to train and improve our intelligent crawler.

Applications were assessed based on the strength of the idea submitted and technical implementation plan, availability of data, and the immediacy of the need for computation power. The companies’ ethical use of data was also paramount.

We continue in our mission is to read the web to create an unparalleled resource for the world’s researchers.

Creative Nation, a new study using structured web data from Glass

Nesta, a global innovation foundation, has launched a new report that combines official and open web data to map the creative industries in the UK. The study was produced in collaboration with the Creative Industries Council.

The report highlights that creative industries are driving economic growth across the UK, on track to create one million new creative industries jobs between 2013 and 2030.

We helped Nesta map the scale of the creative industries across the UK. It’s very hard to measure the state of new sectors or emerging fields. Official industry codes that are used to measure the economy fail to capture new sectors, so official data is not very useful for measuring them. This results in a lack of evidence that hinders policy-making and the market research efforts of companies.

Nesta used Glass because our intelligent crawler has digitally mapped (for the first time) the UK economy, tracking any topic of interest across hundreds of millions of web pages, watching over a million organisations. This new resource that we are building allowed Nesta to identify businesses engaged in ‘creative’ activities and businesses in the UK ‘creative economy’.

Juan Mateos-Garcia, Director of Innovation Mapping at Nesta, said: “With Glass, finding relevant companies for our research took no time at all. We could not have achieved this without their ability to read the web at scale”.

Launch of a new report using Glass: the Immersive Economy in the UK

immersive_blog.jpg

A new report has been published using structured web data from Glass. The study has been the first of its kind to map the UK immersive sector. Commissioned by Immerse UK with funding from Innovate UK, the report provides hard data about the size of the sector, its performance, its geography, the drivers of success and the barriers to growth. It has identified and mapped the organisations developing and applying these exciting new technologies in the UK. The sector is growing rapidly with 1,000 specialist companies with an expected turnover that could reach £1bn this year.

The report is another great example of what’s possible with open web data. With Glass, finding relevant companies for the research took no time at all. The analysis was then prepared by Nesta, a global innovation foundation. MTM London, a research consultancy, also conducted a business survey and in-depth interviews as part of the report.

Here is an excellent read from Nesta where they explain how the report came together (and the role Glass played in it). It highlights a new exciting approach to market research and innovation mapping using web data and machine learning techniques.