Extracting data tables from HTML & PDF documents is like using a rock as a hammer and a screw as a nail.
As part of building some of our data products, the above analogy summarizes well what we went through. Equipped with sedimentary & other metamorphic rocks as well as millions of slotted & cruciform screws, we began a year ago our mission to extract and structure each and every data table available in the world’s public companies’ disclosures - a bit naïvely.
After all, billions of people keep producing & consuming tables in order to arrange data as to provide a familiar way to convey information. What could be problematic in building a technology that automatically identifies, extracts and structures the content of these tables from their original document?
It turns out that the PDF & HTML format files, while being the de-facto standards to communicate business and scientific information, have been primarily designed to display rich textual data for the purpose of human readability. They allow for extreme layout flexibility and their underlying content can be specified in countless ways, making it cumbersome to build robust algorithms that can handle the large number of edge-cases.
We documented our journey in the PDF & HTML data extraction world. Below are visual examples highlighting some of the issues we encountered along the way to create one of the world’s most comprehensive database of numeric facts on publicly listed companies.
Extracting tables from HTML documents
In its mission to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation, the U.S. Securities and Exchange Commission has historically established strict guidelines regarding public corporations’ financial disclosures, from a content to a document layout perspective.
For instance, the Filer Manual – Volume II is a 917 page document that outlines the acceptable HTML document tags a company can use along with their respective acceptable attributes. It also includes file naming conventions and thousands of other rules that are aimed at standardizing the financial reporting process and its output.
Theoretically, this is an awesome framework (along with the XBRL standard which we will discuss in another article) that has the potential to truly facilitate developers’ lives - providing them with an actual hammer and sharp nails. We therefore happily started building a custom library to parse HTML efficiently in order to extract the <table> elements from millions of filings in a reasonable time.
Practically, this ride was not as smooth as we planned. We encountered large bumps in the road, rocks in our shoes and screws in our tires. For example:
Instead of using <li> and <ul> tags to create the above looking list, each of the bullet points are constructed as individual tables containing 1 row and 4 columns – the bullet points being located in the second column and the text content in the fourth.
Some developers (or more probably the HTML generation software the people preparing these filings use) consider the <table> tag as a stylistic element rather than as an ordered arrangement of data in rows and columns. It is a fact - this pattern has been detected in almost all SEC filings.
We thought, “Let’s solve this issue by building a robust machine learning binary classifier to detect and discard these unfaithful data tables.” (NB: we were excited to achieve 99.7% classification accuracy on a 50k sample after few days).
Alas, the SEC filings had more surprises for us:
The right-hand side of the picture above is a screen-shot of the code pertaining to the cell highlighted in blue on the left-hand side.
You see it right. All the visually discernible rows are nested within a single row! The fix was easy - simply split on the <br> tags to obtain a faithful representation of the visual table and this edge case was handled.
Until we ran into another exception. And another one. And another one again. The list can go on. We found hundreds, if not thousands of exceptions we had to properly handle if we wanted to be really serious about our data extraction process.
For example, below is another difficulty that one can expect to run into regularly with these HTML documents. The human eye sees this table as one, yet it was coded as two individual tables having no relationship with one another. That said, the bottom table highlighted in blue cannot be structured properly without the header information of the top table.
We also encountered nested <table> tags even though the SEC explicitly indicates on page 69 of their 917 pages manual that “No nested <TABLE> tags” are allowed.
In the end, our team took the decision to completely shift the tabular data extraction process towards an end-to-end deep learning approach. We leveraged computer vision to identify relevant tables and Natural Language Processing to structure them. However, our work on parsing the HTML trees proved itself useful as it allowed us to generate labels that fueled our networks in a semi-supervised fashion.
Note that an end-to-end deep learning approach is not necessarily better in handling edge-cases. We just found it more maintainable and it had a huge advantage over an advanced HTML parsing approach: it provided us much more flexibility as we could apply our AI models on multiple format files, which leads us to the second part of this article.
Extracting tables from PDF documents
The good thing about SEC filings is their high degree of standardization, bringing our data science pipelines’ generalization error close to null. When it comes to PDF documents, the story is completely different.
98% of the biggest European public companies choose the PDF standard to share their annual reports, Corporate Social Responsibility disclosures, investor presentations and other documents with the public.
More than providing business, societal and financial information to researchers and investors, these documents also have another purpose: to seduce them. As a matter of fact, while SEC filings are mostly composed of monochromatic text and tables, these documents include beautifully designed graphs, charts, tables, backgrounds and pictures.
By definition, every company has a different marketing strategy and positioning as to differentiate itself from its peers. This translates directly into widely different marketing materials that include an infinite number of layouts, typesetting and formatting options – challenging the generalization power of our table identification algorithms.
For instance, it is not uncommon to encounter data tables such as the one above that embed all sorts of graphs within certain rows and columns for the purpose of improving human readability at the impediment of machine readability. In case you are wondering, our solution here was to detect the presence of images within a table’s bounding box coordinates and filter these regions out.
Overall, the most effective process we found to achieve outstanding results in table detection within PDF documents is iterative: label lots of data, train an instance segmentation algorithm, find ways to detect false positives/true negatives and their prevalence, fix the most frequently arising ones and start the loop over and over again… until you get a model that is able to identify tables with a similar degree of precision:
Table or not table – how would you have labeled this one?
Even if our computer vision models were achieving 100% accuracy in detecting data tables, the PDF format file makes the extraction and structuring parts the most complex given their flexible nature. PDFs consist of objects (mostly dictionaries) and instruction streams that result in a haphazard soup of floating characters on each page.
Sometimes the soup is not that beautiful, as illustrated below:
The green bars in the above table are highlighted text blocks that are visually hidden to the reader but programmatically present within the table. Their content is the following:
“All of the biggest technological inventions created by man - the airplane, the automobile, the computer - says little about his intelligence, but speaks volumes about his laziness. - Mark Kennedy”
... Although our team is always excited to discover inspirational quotes, catch secret information or crack encrypted messages, these hidden text blocks mostly contained things like:
“of the Group. The Company intends to lodge its NGER Report for the Group for the period FY2015 in October 2015. An energy target has been set for the first time for period FY2016, at a 2.5% reduction of total carbon dioxide equivalents (tCO2-e)”
We noticed that these hidden elements were always present in other pages of the documents. After quite some research and method trials, we figured that the most effective solution to deal with this issue was to run the corresponding problematic pages through Optical Character Recognition (OCR) algorithms.
To finish this article on a great note, below are two screenshots of one table that bears not one, but two complexities:
First of all, the above table is split vertically over two pages, making the reconciliation process surprisingly complex. What’s more, the blue bounding box in the bottom image showcases another difficulty: a white-space is present between the words “shares” & “Julius” rendering them as a single text-block even though they belong to different columns. This is problematic as our initial methodology to reconstruct a table into cells relied on merging words that are split by a white-space character into the same cell. Consequently, applying this heuristic on the above table would result into merging two distinct columns into one.
We learned the hard way that extracting data from PDF & HTML documents requires discarding any assumptions about their structure. Ultimately, we shifted to Natural Language Processing models to recognize the structural body cells in detected tables, achieving stronger results than heuristic based methods.
In fact, we encountered so many more issues related to PDF data extraction, but we will not go over all of those here. If you liked our content, let us know and we will share some techniques and other insights on the subject in the following weeks.