Hierarchy Detection

Hierarchy Detection identifies every paragraph, table, and section header and placing them in the appropriate place in the document’s hierarchical structure. Producing a highly detailed table-of-contents-like tree, representing not just a document’s main topics but each section’s subsections, and subsubsections and so on.

What tools are used for Hierarchy Detection?

A pipeline of machine learning models and hand-crafted algorithms comprising state-of-the-art Computer Vision, Natural Language Processing, and information retrieval techniques.

Semantic Extact - Financial Statement.png

Semantic Extract - Hierarchy Detection.png

What is document hierarchy?

Document Hierarchy is what we call the structure we apply to the atomic sections of a document such that they are organised in a hierarchical pattern; A tree structure that corresponds to our intuitions about how the document is structured.

The tree is ordered by reading order and begins with a “Root”. It is essentially a very highly detailed version of the Table-of-Contents navigation tool you occasionally see in some PDFs.

An example of hierarchy in Semantic Extract

There are 3 types of sections you will notice in the hierarchy UI:

Tables: are shown in blue in the UI. All atomic sections inside the bounds of the table are considered part of the same unit for the purposes of hierarchy and are tagged as a whole.
Titles: are shown in bold in the UI, and denote sections that describe/summarise the sections that follow. A more precise definition is discussed in the next section.
Ordinary Sections: are anything that is not a table or a title. These make up the majority of sections.

example of hierarchy in Semantic Extract.png