Traditional company databases and contact lists have been the mainstay of business and economic research since before the age of the Internet. Whether this has been for market research, policy development or sales and marketing purposes, the company databases and the businesses that provide them have been king. Over time these databases have grown to include a variety of modern developments in business analytics: firmographics, demographics, funding data, technographics, intent data, alerts, etc. A veritable word soup of complementary datasets and competing technologies. However, these incremental changes have not delivered an escape from the constraints of slowly moving, out-of-date, fixed views of businesses and their activities.

Given the volume and depth of information online, it’s about time the fundamentals of business and economic research were able to meet the challenge of being able to take advantage of this scale of content efficiently and accurately. While there are smaller niche datasets — including internal company lists — that are manually kept up to date using the Internet and other open sources, the general problem of supporting up-to-date, unconstrained business research and the range of unique economic queries that entails can only be solved with technology.

To address this challenge, at glass.ai we have developed AI that can understand language at scale and we have applied this technology to turn the unstructured content found across the open web into structured business datasets and analytics. This means that our clients and partners do not need to keep their imagination in check as they are not restricted to a static view of the world with a fixed set of attributes. They can ask any questions they want to ask to efficiently support their sector research or campaign strategy.

Limited by a single source

While there are a vast number of company databases to choose from, the majority of them started from the same baseline. Certainly, that is the case for the very large general business databases which started life from official government sources that maintain lists of active businesses in their jurisdiction. Particular company databases may extend this core in different ways, maybe adding other official data (e.g. revenues), links to websites or social media profiles, finding relevant descriptions, or enriching with people and contact information. Unfortunately without appropriate technology, the entity disambiguation and reconciliation needed to accurately match these attributes from different sources can leave a lot to be desired. We’ve seen cases where 1 in 5 records have been poorly matched, leading to many inaccuracies and misrepresentations. Even in cases where the matches have been performed well, keeping the data in line with changes is a challenge. The official source may be able to be refreshed, but any additional attributes quickly become out of date and are poorly maintained. As time goes on many company databases become clogged up with dead and inactive businesses. Apart from any issues with the data, as these databases are all drawn from the same source, it means that everyone has access to and is using the same data. The data is not a differentiator for either the company selling the data or any of the organisations that use it.

Using AI technology to build bespoke up-to-date datasets from the open web can provide the unique insights organisations need. For example, a traditional company database might contain a list of addresses that a business operates in but maybe you want to find countries that it is likely to expand into. This is a challenge that we were set by a client who wanted to uncover potential inward investment opportunities. News articles that talk about expansion might be a source for this, but by the time these are published key decisions have probably already been made. We considered expansion news, but we were also able to uncover countries in which businesses had projects, customers, partnerships or other parts of their supply chain. These connections make it more likely that they would expand into the region with a physical presence so present a better inward investment opportunity.

Big isn’t better

It seems like a contradiction to talk about sourcing business research directly from the web and then saying big isn’t better, but here we’re talking about the output, not the input! Most company databases suffer from poor categorisation, these can either be due to inaccuracies or more likely very coarse-grained or fixed sets of definitions. This means that when looking for a specific type of business, the query will need to be mapped onto broader and ill-matched definitions, resulting in many results outside the scope of the research. The worst offenders here are those dataset providers based on official taxonomies, such as NAICS codes in the US or SIC codes in the UK. Even databases that have more refined taxonomies often fail to reflect the precise nature of the types of businesses that need to be targeted. For example, suppose you want to map the solar panel manufacturing sector. There’s no code in official taxonomies for that, you might get to the electrical equipment manufacturers. Even in company databases with finer-grained categories, where you might get down to renewable energy companies or even the solar power industry, the category will be filled with all the other components of the industry: installers, generators, specialist recruiters, etc. Targeting businesses with a high level of relevance is challenging.

Fine-grained categorisation is only possible in general if you are defining the taxonomy on the fly, gathering the necessary information to inform the categorisation, and have the technology necessary to interpret the content to match the target requirements. Using traditional categorisations a client of ours was unable to break down its customers below the level of the AEC (Architecture, Engineering and Construction) industry. They needed to understand which segment of the industry their customers operated in. Did they design roads or bridges, were they building contractors or property developers, did they sell or rent properties? Deep-reading content from each of their customer’s websites, looking at project pages, service offerings and news articles, our technology was able to build a comprehensive categorisation of each business to allow our client to have a much better understanding of their customer base to improve retention and upsell opportunities.

Company descriptions don’t tell the full story (part 1)

Company databases often incorporate a description of the company drawn from some external source or may be provided by the company owner. This can support search and keyword matching and more advanced semantic search methods for identifying relevant results. However, without more sophisticated language understanding these methods can hit false positive rates up to 50% in fairly straightforward cases. For example, if you are looking for developers of immersive technologies then your go-to keywords are going to start with “virtual reality” and “augmented reality” (which are good distinctive search criteria) — however, without building large panels of positive and negative context words you are going get creators of immersive content, VR gaming venues, sellers of VR headsets and many other businesses where immersive technology is key to what they do but they are not developers. Even if you can get to the perfect match on the descriptions, you will still be missing large numbers of significant businesses. Large companies like Google and Infosys undertake many immersive development projects but it is not something that appears in high-level descriptions of their businesses.

A description can not tell the full story, we need the capability to intelligently crawl for other significant content. One sector mapping exercise that we’ve undertaken was to map the artificial intelligence sector. The client was interested in discovering which companies had a significant reliance on AI. As with the above immersive example, it is challenging but not impossible to find specialist AI organisations. However, AI is a cross-cutting technology. Not only is there the problem presented by larger companies, but smaller businesses will have AI developers that help deliver their products or services but it won’t be core to their mission. As an illustration of both these problems, BT is a well-known telecommunications company but has several hundred employees involved in Artificial Intelligence-related activities. The economic impact of this is likely larger than the majority of specialist AI firms which are going to have smaller development teams. Getting a deeper understanding of what activities the company is involved in requires looking more broadly across the available open web content, otherwise, only a partial picture of a sector or targets is going to be possible.

Company descriptions don’t tell the full story (part 2)

Using descriptions to identify businesses of interest assumes that you’re interested in the company's core activities. We’ve already seen that is not going to be the case when you need to consider all the activities that the company is involved in. However, maybe you’re not interested in what they do — or at least that is only one part of the story. For example, you might want to know if they’ve adopted a specific technology — e.g. EasyJet using augmented reality to train pilots — or if a particular event has occurred — e.g. Dyson opening a new factory. This involves tracking press releases, case studies, blogs and other sources to identify relevant articles regularly. Companies will often talk about positive news on their website, however negative news, e.g. redundancies or legal issues, can be harder to find. Think back to the earlier inward investment example, finding negative signals might be a good reason to ignore some of the opportunities identified. In these cases, you have to start looking beyond the business's website and into third-party news sources. This becomes more challenging as now, as well as identifying relevant articles, you have to resolve any mentions of companies within the article to make sure they are disambiguated to match to the correct entity. While small data sets might be able to be monitored manually or with low-tech tools like Google Alerts, doing this for any moderate-sized list requires technology.

Monitoring the activities of companies in terms of the actions they take can be critical to choosing the right time to approach them. For example, a global financial services company wanted to improve their sales success rate. They had a size of business and set of industries in mind, but they were getting very low success rates using traditional lead generation tools. We proposed monitoring a broad range of growth signals across a large set of their target companies — in the end over 150,000 different businesses. The growth signals included employee growth, job listings, opening new facilities, product launches, awards, funding activity, supply chain information and many other areas. This involved monitoring all mentions of the companies on the web and interpreting the context in which they were mentioned to see if they matched any of the growth signals. This required a deep understanding of the content to ensure false positives weren’t going to impact the results. When the client tested these signals against traditional methods they saw a significant increase in the sales success rate for businesses indicating growth through these signals.

What about you?

All of the above has focused on the businesses that you are trying to identify, whether part of a sales or marketing campaign, sector research exercise, economic analysis or policy development. You need a capability that can take your specific unique research requirements and match them to the unique set of businesses that you want to research or target. No two datasets are going to be the same, even when they seem to be targeting the same companies. We’ve done several projects where the client has been targeting the same sector but, because of their view of how the sector should be composed and the characteristics of their own business or the specifics of how the data would be used, the datasets produced have contained overlapping but different outputs, both in terms of the companies in the list and the attributes they wanted to target.

For example, we’ve mapped the Medical Devices sector as part of regional economic studies we have performed. However, we had a client who also needed to map the Medical Devices sector as they had a product that would help these companies manage all the necessary compliance requirements and documentation. As you can imagine, there is quite a commitment needed on the compliance side when developing solutions in healthcare, however, not all medical devices have the same level of requirements. Our client was only interested in the companies that had products that had to follow the most stringent rules and so were more likely to benefit from the products that were on offer. By building language models of descriptions of these classes of medical devices and cross-referencing to open sources such as the FDA, we were able to build a medical device company dataset that had a better chance of success when offered these compliance management tools.

As all our examples have shown, the devil is in the details. Using current company databases delivers generic poorly targeted datasets which need a lot of manual attention to support a policy study or economic research, or will lead to poor success rates in sales and marketing situations. Switching on to building live datasets using AI to exploit the deep rich content available on the web allows precise targeting of your requirements to build your unique data and deliver results. Traditional company databases are not dead yet, but their days are numbered. AI is already disrupting this industry.

The Days of Company Databases Are Numbered.

Limited by a single source

Big isn’t better

Company descriptions don’t tell the full story (part 1)

Company descriptions don’t tell the full story (part 2)

What about you?