Information Extraction

Sarah M. Taylor
Lockheed Martin Corporation

What is Information Extraction?
Information Extraction (IE) is the process of automatically identifying and labeling entities, relationships and events in natural language text.  IE is not a list matching process.  It requires finding items of certain types, such as person names, without knowing ahead of time what specific names will be found. The output of IE can be text with markup included within the text or a set of database entries.

IE may be best understood as having several levels of difficulty.  The simplest is entity extraction – the identification of person names, place names, organization names, time expressions, equipment names, facilities, currency amounts, and so forth.  Relationship extraction is the next level of difficulty and includes the identification of links between two entities, such as the fact that a certain person works for a certain company, or that a certain type of equipment is produced at a certain facility.  There are two levels of event extraction.  At the easier level lies the extraction of basic events, as in a subject-verb-complement triple – for example, “The President arrived in Baghdad.”  Complex event extraction means the identification of all the elements of a complete event – such as the perpetrator, victims, date, location, weapon, sponsoring organization, and damage related to a terrorist incident.

Information Extraction technology can employ either human developed rules, or machine learned patterns, or some combination of the two.  In either case, systems are developed using examples; text of the type that the system is expected to process is tagged by hand in the way that the end-user would like to see the output of the IE application. For example, government analysts interested in non-proliferation issues might mark open source newspaper text for every business transaction that suggested the transfer of key equipment or know-how to countries of concern. Machine learning systems then learn from the patterns of this tagging.  For these systems, a large sample of marked up text is required, and the system must be retrained for any new type of text or new subject area.  For human developed rules, the rules developers, also called Knowledge Engineers, learn from the hand-tagged text and from discussions with the end-users what kinds of information the application needs to discover.  They then develop rules to find that type of information in the text.  While human rule development requires skilled Knowledge Engineers (and Machine Learning systems do not), typically less text must be marked initially, and rules can be changed fairly quickly to adapt to new texts and subject areas.  There is some data to suggest that the difference in initial and continuing investment between the two types of systems is not large.  Today, only human developed rule based systems appear to be reaching useful levels of accuracy for complex event extraction.

Why is IE important?
Fundamentally, IE brings predictable structure to the information in natural language text.  Meaning is encoded with a clearly defined set of tags, or mark-ups, that is manageable by computer systems.  By focusing on specific subject domains and types of text, IE systems are able to tame the enormous variety of human language expression into something that is predictable and useful in a large number of database and data mining applications.  Text, by itself, without this additional structure, cannot be used for link analysis, data mining, time lines, geographic analysis, target tracking and so forth.  IE is the lynch pin that enables us to automate these follow-on analytic processes.  Thus good IE is imperative for any automation of analysis tasks.

What is the current state of the art in IE?
The basic concept of IE was developed in the 1980s.  In the middle of that decade, DARPA (Defense Advanced Research Projects Agency) began funding the Message Understanding Conference (MUC) in which a number of IE research organizations joined together to test the performance of their systems against a common data set.  The initial tests were against short, tactical naval messages, concerning movements of vessels and like activities.  The MUC continued, supported by the DARPA Tipster Program, through the late 1990s and has since been replaced by ACE (Automatic Content Extraction), which is managed by NIST.

A number of reputable commercial products exist for IE processing.  Research in this area also continues to be funded by various Intelligence Community and DoD organizations.  The simplest IE, entity extraction, is gradually becoming more widespread, operating at an enterprise level in both DoD and some IC agencies.  Entity, event and relationship extraction of a generic type – not heavily tuned to any particular subject domain and thus not highly accurate – is now incorporated in any number of commercial products for text handling, such as Information Retrieval, categorization, data mining, and geographic systems.  There are a small number of operational applications using highly tuned complex event extraction in the Intelligence Community.  However, they are not as widespread as one would guess they should be, given the amounts of time and effort they can save users.  We have anecdotal evidence of one analysis task being reduced from about 40 days to about 2 days; and a recorded experiment in which a task was reduced from about 40 days to about 6 days.  Despite these examples, there seems to be continued widespread skepticism that the technology is mature enough to support analysis.

The accuracy of IE for entity extraction on open source text is about 85% without particular tuning.  With tuning to text and domain, entity extraction accuracy can go as high as to be nearly indistinguishable from expert human performance (e.g., higher than 95%).  This performance includes entity extraction on upper case US Government message traffic.  Testing in ACE, however, is not showing particularly high figures (roughly 50% correct) for more difficult problems, such as relationship extraction and co-reference issues.  Since we know of at least one operational system that does much better than that (roughly 85%-90% correct) on an equally difficult problem, there is some question as to why IE performance is not better in the open tests.  My speculation suggests three possible factors:

  • ACE may be defined to test features that are not actually of great operational significance, and therefore don’t show up in operational systems;
  • Research funding today heavily favors machine learning and statistical approaches, which are less able to deal with the long range phenomena required for complex extraction;
  • Researchers may simply not be addressing IE with the same persistence as it was addressed a decade ago due to a plethora of other issues on their plates.
IE systems are available for many languages, although English is most widely used in the US Government.  Current key research areas in IE are:
  • Within document co-reference – the ability to track all the references to one entity throughout a document and understand they are all references to the same thing
  • Across source co-reference – the ability to understand that an entity cited in one document is the same, or not, as one cited in another
  • Time normalization – the ability to understand time references as related to GMT or some other standard
  • Place normalization – the ability to tag any place reference with geo-coordinates.
Research into IE issues is being funded by DARPA, ARDA, AFRL, NSA, and CIA at least.  However, in a number of these cases IE is only an adjunct to other more high profile research efforts.  Overall, the level of support for IE research does not appear to be as focused or as substantial as it was in the 1990s.

Where should IE be going in the future?
I believe we know how to solve the issues currently under research in IE.  Not everything has been done yet, but there exist known methods that can be applied to each of the current problems listed above.  Beyond addressing these current known issues, we must also promote the acceptance and use of IE systems, given the lynch pin role that IE plays in enabling automatic support to analysis.  While accuracy is most often cited as the problem with IE, in fact, I believe that is not the issue.  The problem is better defined as the effort required to get to useful levels of accuracy, and that level of effort appears to be dropping more rapidly than people are aware for at least these reasons:

  • The difficulty of estimating costs before a system is put in place.  At the words “tuning” purchasers may think they hear the sound of years and years of rule development.  However, our experience suggests this is not the case.  Applications have remained stable over a number of years (3-4) as long as the task itself has remained stable.
  • IE applications require some investment of time from the analysts, to carefully define the task that is being automated.  Like the design of a relational database, this is not a natural approach for most people to take, and some of the doubts about IE may be a resistance to carefully defining and making systematic a process that in the past has often been more ad hoc and intuitive.
These are fundamentally engineering issues.  For research, the future is the pursuit of a “deeper” understanding of meaning in text.  The IE we know today is aimed primarily at factual understanding – who did what to whom and when.  However, the nuances of text are left undisturbed by this approach.  And in the end what we frequently want to know is more subtle.  What is a certain person’s attitude toward the people, countries and policies he is discussing?  Is she for or against more Sunni power in the emerging Iraqi government?  What are the interactions between various speakers or writers in a chat session or email chains?  Who is leading the dialog and pushing what opinions?  Finally, what are the clues to people’s intentions in their speech and writing?  Does certain text suggest the use of coded or deliberately obscure language?  Do certain kinds of writing or speech suggest greater proclivity for violence?  These types of questions are just surfacing now in the research of language understanding and IE technology will have a significant role to play in exploring them more fully.


Close