Datacie, the leading database creation company, has partnered with IEX Cloud, one of the fastest-growing financial data infrastructures across the industry. IEX Cloud's tens of thousands of financial data users can now effortlessly access historical employee headcount data covering IEX Cloud's U.S. symbols universe.

"To win in the marketplace, you must first win in the workplace."

This quote from Doug Conant, the former President and CEO of the Campbell Soup Company (number of employees: 14,500 as of August 2, 2020), is no secret to anyone. Theory at the micro and macro levels predicts that companies attracting and retaining the best talents generate significantly better firm-level performance. The theory is backed up with many empirical studies1,2,3,4 that consistently show that human capital relates strongly to company performance in the stock market.

According to the OECD5, employee count is one of the most significant indicators of a company's health and growth. Companies with steady year-over-year employee growth or incremental increase in employee count are good signs that business is on the upswing. Employee count is also a good proxy of a company's bandwidth: how many products can it develop and maintain? How many customers can it acquire and serve? How many big contracts and partnerships can it land? Undeniably the answers strongly depend on the company's human capacities.

Given its numerous usage possibilities, employee count data is highly prized in the financial community. For instance, it can be used for:

  • Alpha Generation: Investors use employee count growth to identify the fastest-growing companies and generate alpha.
  • Company Valuation: Investors estimate a company's value based on factors such as the company's number of employees, free cash flows, or dividends.
  • Data Comparability: Employee count is a well-known metric to normalize financial data and make it more comparable across time and entities, such as revenue-per-employee or assets-per-employee. It also plays a central role in quantitative modeling: scaling/normalizing data always improve linear models' numerical stability and interpretability.
  • Market Analysis: Investors use employee count data to filter or segment small and medium-sized from large companies and compare/analyze them accordingly.
  • Portfolio and Risk Management: Employee count data is commonly used to hedge or concentrate a portfolio's holdings in regards to human capital.

This data is also frequently used in sales for targeting purposes, in human resources for headcount demand/supply forecasting, and in many other applications.

Despite its high number of use cases and against all expectations, employee count data is not widely available, and for good reasons. This data is challenging to acquire, normalize and keep track of. It is challenging to acquire as seeking employee count data in hundred pages' corporate disclosures is like looking for a needle in a haystack. It’s also difficult to normalize as this data is not subject to any reporting standard: companies can disclose this information in the format and layout they want. Lastly, this data is work-intensive to keep track of as this headcount is dynamic and evolves rapidly over time.

Some data providers conveniently capture and disclose employee count data from LinkedIn corporate pages for the above reasons. Although the data is relatively easy to scrap, the resulting employee count data is inaccurate and inconsistent: for instance, Apple officially reported having approximately 147,000 full-time equivalent employees as of September 26th, 2020, when the LinkedIn page of the company indicated that over 250,000 people were working for the company on that same day.

Committed to delivering the industry's highest data quality to their clients, IEX Cloud decided to enter into a strategic partnership with Datacie to provide accurate and timely employee count information in both their Core Data and Premium Data offering. The rest of this article details how Datacie leveraged its leading database creation technology to create one of the world's largest employee count databases in just a few weeks.

Extracting employee count from tens of thousands of corporate filings

Acquiring Unstructured Data

The journey to build the employee count database starts with acquiring all the documents and other unstructured data sources that contain relevant information.

For U.S. companies, the employee count information can be found in SEC filings thanks to regulation S-K prescribed under the U.S. Securities Act of 1933. For international companies, the employee count information is usually found in annual reports, CSR documents, press releases, and other documents that public companies disclose to investors.

To acquire these documents, Datacie has developed proprietary scraping and website monitoring capabilities to track and acquire the raw data from the web minutes after such documents are made publicly available by reporting entities or regulatory agencies.

Automated Data Extraction

Following the raw data acquisition comes the data extraction process. Datacie leverages state-of-the-art technologies to identify and extract precise data points among terabytes of unstructured content. The data extraction process essentially boils down to a series of inter-dependent deep learning models taking roots in Computer Vision and Natural Language Processing. Each algorithm that constitutes the data extraction pipeline is trained to achieve a precise task (for example, language detection, layout segmentation, tabular data detection, information retrieval, etc.), and each step's output constitutes the input of the following step. The last stage of the extraction pipeline consists of encoder-decoder neural networks trained to tag and classify employee-related data points from relevant sentences, data tables, and figures.

Human-in-the-loop Quality Assurance

What makes Datacie technology truly unique is its human-in-the-loop architecture: each and every prediction made at any step of the data extraction pipeline is associated with various quality indicators. Any time these confidence scores are below certain thresholds, the observations are automatically sent to qualified human annotators that validate or reject the algorithms' predictions. This continuous feedback loop allows Datacie to improve its algorithms' performances continuously, but most importantly, the company is committed to creating error-free data products.

Datacie's technological goal is not to achieve 100% automation; instead, it is to detect and differentiate where human intelligence is needed from where it is not.

The employee count information is particularly challenging to automate because of the lack of standardization in the reporting process. Around 60 to 70% of the companies report their total number of employees, their total number of full-time/part-time employees, or their total number of full-time equivalents; for these companies, deriving the average full-time equivalents figure is straightforward. However, the remaining 30% showcased a large variety of edge cases that needed additional considerations, to name a few:

  • How to account for contractors, seasonal, temporary, at-will, hourly, or leased employees?
  • How to detect companies that report their employee count under each of their reporting segments?
  • How to handle companies that report having no employees?
  • How to consider subsidiaries, affiliates, parents, or joint venture-related personnel?

Datacie's team keeps tight quality assurance thresholds, resulting in tens of thousands of documents that required to be reviewed by human annotators who were asked to follow a strict annotation guideline to the letter. To achieve consistent data entries across annotators, the team ensured that each employee count data point was reviewed at least by two different persons before continuing its journey in the data extraction pipeline.

Full Database Audit

After initial inception, the entire dataset was audited and evaluated for trustworthiness. Every employee count passed through hundreds of quality checks that automatically identify outliers and potentially misreported observations. Additional manual checks were performed on low-confident data points, ensuring that the final employee count database is free from poor-quality observations.

Today, the complete employee count database is available to all IEX Cloud users in two places:

- Most Recently Reported Employee Headcount (Updated Monthly): This provides the most recently reported data on employee headcounts for U.S. companies. Available with free IEX Cloud plans on the Company endpoint, via the “Employees” field.

- Most Recently Reported Employee Headcount (Updated Daily) Plus 10 Years History: Get historical employee headcount data for the past 10 years – ideal for tracking trajectories and trends with the most up to date information. Available as an add-on with paid IEX Cloud plans as a Premium Dataset.

Let us know how does this data bring you value. The IEX Cloud and Datacie teams are also looking to expand the scope of their partnership with new data projects; send us a note at info@datacie.com with any questions, requests, or comments!

About IEX Cloud

IEX Cloud is the data delivery platform owned by IEX Group, the financial technology company that also separately operates the Investors’ Exchange LLC (“IEX Exchange”), a U.S. securities exchange committed to serving all market participants. Since 2019, IEX Cloud has been setting new standards for easy delivery and use of financial data. It offers a flexible, accessible model for connecting developers with curated financial data and provides a high-performance API and custom-built services to help users build, launch, and scale their models, products, and businesses. Learn more at iexcloud.io.