Components
Detection of blank pages
Classifies a document page as blank if the page has no content, or having content if the page is not blank. Content reflected through the back of the page or content elements such as page numbers and pencil markings may lead to incorrect classification.
Detailed description –> Description of the blank image detection component
Identification of post-it notes
Classifies the document page into two classes based on whether or not the page contains one or more post-it notes. Content elements that resemble post-it notes, such as coloured rectangular text fields, may result in incorrect classification.
Detailed description –> Description of the post-it note identification component
Folded Corner Detection
Classifies the document page into two categories based on whether or not the page contains folded or torn sections. If a document corner resembles a fold or tear due to its colour or shape, this may lead to an incorrect classification.
Detailed description –> Description of the Folded Corner Detection component
Metadata identification
Identifies name identities from the result of the optical character recognition performed on the document page, produces descriptive keywords for the text and determines the language used in the text content. Based on the identified language, either Finnish or English models are used in identifying and keywording metadata. The Annif software developed at the National Library of Finland (www.annif.org) is used in automated keywording. Identification results depend on the quality of OCR, so image file inaccuracy, text layout, and table-like content elements, for example, may affect the quality of metadata identification.
Detailed description –> Metadata component description
Typeface recognition
Classifies the document page into one of the following three classes on the basis its typefaces: 1) “typed” if the page only contains typed text, 2) “handwritten” if the page only contains handwritten text, and 3) “combined” if the page contains both typefaces. Combination class identification is the most difficult of these for artificial intelligence, especially if there is little handwritten text.
Detailed description –> Description of the typeface recognition component