UIMA — Hublog

UIMA stands for Unstructured Information Management Architecture.

UIMA allows a processing pipeline to pass a text document, for example, to multiple UIMA-compliant services and receive back a collection of annotations for that document. The annotations may include parts-of-speech, named entities, etc.

UIMA is in the final processes of being approved as a standard by OASIS. The working group has produced a latest draft of the UIMA specification [PDF].

A UIMA SDK for Java/C++ was developed by IBM and open-sourced as an Apache Incubator project in 2006.

The University of Tokyo and National Centre for Text Mining (NaCTeM, UK) have collaborated to produce U-Compare, a cluster-hosted repository of UIMA components, due to launch at the end of this month.

Other UIMA components are available.

UIMA uses SOAP for interaction between services.

From the draft specification:

In UIMA the original content is not affected in the analysis process. Rather, an object graph is produced that stands off from and annotates the content. Stand-off annotations in UIMA allow for multiple content interpretations of graph complexity to be produced, co-exist, overlap and be retracted without affecting the original content representation.

CAS (Common Analysis Structure) objects are used to represent documents and annotations; these can be passed between components as a standardised XML serialisation called XMI (XML Metadata Interchange; used to be XCAS).