Paragraph about Scientific software
The US Patent and Trademark Office (USPTO) among other government agencies has an interest in representing complex documents, and you can look at a Concept of Operations for the Distributed Object Computation Testbed (DOCT) sponsored by DARPA and USPTO. Among the "key DOCT benefits".. "Capabilities Scientific software include the ability to support information in a wide variety of formats (such as ASCII text, VRML, CGM, JPEG, MPEG, WAV, chemical expressions, mathematical equations, and biological sequences)." The work by SAIC in developing recognition of CWU (complex work units) consisting of mathematics is particularized to 300 dpi monochrome, clean, (unbroken, non-touching characters) accurate, minimally skewed images. The OCR is trained on about 350 "symbols" where some of these symbols are actually parts of a multiply-connected symbol. The OCR provides mathlab alternative recognition results in case of uncertainly (An aside: after some exploration, SAIC determined that neural network based approaches were unsuitable, and so a typical ad-hoc pattern matching approach was used to recognize symbols.) The results are tree-like, with some recognition of the most common expressions types (e.g. quotients). Further information has appeared in various similar memos on-line at SDSC. See for example the June, 1997 research report on Task A2/A3 Automated SGML Tagging . It is not clear if this is the final report or an intermediate progress report.