Blossary – computerized document analysis

首页 > Term: computerized document analysis

computerized document analysis

Digital imaging is a mature technology that involves the acquisition, indexing, compression, storage, authentication, transmission, retrieval, and display of entire pages of documents by means of computers. Document image analysis attempts to go a step further, to extract the content of the digitized documents and convert it into a form suitable for digital processing. Document image analysis includes the processing of graphic documents such as engineering drawings, circuit schematics, and organization charts and maps, but this discussion is restricted to mostly-text documents.

The extraction of content from the digital image requires two major phases. The first phase, identification of the logical or functional components of a document, is called layout analysis. The second phase, encoding of the glyphs (letters, numerals, mathematical symbols, and punctuation) into a computer representation such as the American Standard Code for Information Interchange (ASCII), rich-text format (RTF), or UNICODE (a new international standard) is called optical character recognition. Both of these steps require prior knowledge of the underlying script. Furthermore, the accuracy of text recognition can be improved by postprocessing that applies linguistic constraints.

0 0