|
Sarah M. Taylor
Lockheed Martin Corporation
What is Information Extraction?
Information Extraction (IE) is the process of automatically
identifying and labeling entities, relationships and events in
natural language text. IE is not a list matching
process. It requires finding items of certain types, such as
person names, without knowing ahead of time what specific names will
be found. The output of IE can be text with markup included
within the text or a set of database entries.
IE may be best understood as having several levels of
difficulty. The simplest is entity extraction – the
identification of person names, place names, organization names, time
expressions, equipment names, facilities, currency amounts, and so
forth. Relationship extraction is the next level of difficulty
and includes the identification of links between two entities, such
as the fact that a certain person works for a certain company, or that
a certain type of equipment is produced at a certain facility.
There are two levels of event extraction. At the easier level
lies the extraction of basic events, as in a subject-verb-complement
triple – for example, “The President arrived in
Baghdad.” Complex event extraction means the identification
of all the elements of a complete event – such as the
perpetrator, victims, date, location, weapon, sponsoring organization,
and damage related to a terrorist incident.
Information Extraction technology can employ either human developed
rules, or machine learned patterns, or some combination of the
two. In either case, systems are developed using examples;
text of the type that the system is expected to process is tagged by hand
in the way that the end-user would like to see the output of the IE
application. For example, government analysts interested in
non-proliferation issues might mark open source newspaper text for
every business transaction that suggested the transfer of key
equipment or know-how to countries of concern. Machine learning
systems then learn from the patterns of this tagging. For
these systems, a large sample of marked up text is required, and the
system must be retrained for any new type of text or new subject area.
For human developed rules, the rules developers, also called
Knowledge Engineers, learn from the hand-tagged text and from
discussions with the end-users what kinds of information the
application needs to discover. They then develop rules to find
that type of information in the text. While human rule
development requires skilled Knowledge Engineers (and Machine
Learning systems do not), typically less text must be marked
initially, and rules can be changed fairly quickly to adapt to new
texts and subject areas. There is some data to suggest that
the difference in initial and continuing investment between the two
types of systems is not large. Today, only human developed rule
based systems appear to be reaching useful levels of accuracy for
complex event extraction.
Why is IE important?
Fundamentally, IE brings predictable structure to the information in
natural language text. Meaning is encoded with a clearly
defined set of tags, or mark-ups, that is manageable by computer
systems. By focusing on specific subject domains and types of
text, IE systems are able to tame the enormous variety of human
language expression into something that is predictable and useful in
a large number of database and data mining applications. Text,
by itself, without this additional structure, cannot be used for link
analysis, data mining, time lines, geographic analysis, target
tracking and so forth. IE is the lynch pin that enables us to
automate these follow-on analytic processes. Thus good IE is
imperative for any automation of analysis tasks.
What is the current state of the art in IE?
The basic concept of IE was developed in the 1980s. In the
middle of that decade, DARPA (Defense Advanced Research Projects
Agency) began funding the Message Understanding Conference (MUC) in
which a number of IE research organizations joined together to test
the performance of their systems against a common data set.
The initial tests were against short, tactical naval messages,
concerning movements of vessels and like activities. The MUC
continued, supported by the DARPA Tipster Program, through the late
1990s and has since been replaced by ACE (Automatic Content Extraction),
which is managed by NIST.
A number of reputable commercial products exist for IE
processing. Research in this area also continues to be funded
by various Intelligence Community and DoD organizations. The
simplest IE, entity extraction, is gradually becoming more
widespread, operating at an enterprise level in both DoD and some IC
agencies. Entity, event and relationship extraction of a
generic type – not heavily tuned to any particular
subject domain and thus not highly accurate – is now
incorporated in any number of commercial products for text handling,
such as Information Retrieval, categorization, data mining, and
geographic systems. There are a small number of operational
applications using highly tuned complex event extraction in the
Intelligence Community. However, they are not as widespread as
one would guess they should be, given the amounts of time and effort
they can save users. We have anecdotal evidence of one
analysis task being reduced from about 40 days to about 2 days; and
a recorded experiment in which a task was reduced from about 40 days
to about 6 days. Despite these examples, there seems to be
continued widespread skepticism that the technology is mature enough
to support analysis.
The accuracy of IE for entity extraction on open source text is
about 85% without particular tuning. With tuning to text and
domain, entity extraction accuracy can go as high as to be nearly
indistinguishable from expert human performance (e.g., higher than
95%). This performance includes entity extraction on upper
case US Government message traffic. Testing in ACE, however,
is not showing particularly high figures (roughly 50% correct) for
more difficult problems, such as relationship extraction and co-reference
issues. Since we know of at least one operational system that
does much better than that (roughly 85%-90% correct) on an equally
difficult problem, there is some question as to why IE performance
is not better in the open tests. My speculation suggests three
possible factors:
- ACE may be defined to test features that are not actually of great
operational significance, and therefore don’t show up in operational
systems;
- Research funding today heavily favors machine learning and statistical
approaches, which are less able to deal with the long range phenomena
required for complex extraction;
- Researchers may simply not be addressing IE with the same persistence
as it was addressed a decade ago due to a plethora of other issues
on their plates.
IE systems are available for many languages, although English is
most widely used in the US Government. Current key research
areas in IE are:
- Within document co-reference – the ability to track
all the references to one entity throughout a document and
understand they are all references to the same thing
- Across source co-reference – the ability to
understand that an entity cited in one document is the same, or not,
as one cited in another
- Time normalization – the ability to understand time
references as related to GMT or some other standard
- Place normalization – the ability to tag any place
reference with geo-coordinates.
Research into IE issues is being funded by DARPA, ARDA, AFRL, NSA,
and CIA at least. However, in a number of these cases IE is
only an adjunct to other more high profile research efforts.
Overall, the level of support for IE research does not appear to be
as focused or as substantial as it was in the 1990s.
Where should IE be going in the future?
I believe we know how to solve the issues currently under research
in IE. Not everything has been done yet, but there exist known
methods that can be applied to each of the current problems listed
above. Beyond addressing these current known issues, we must
also promote the acceptance and use of IE systems, given the lynch
pin role that IE plays in enabling automatic support to
analysis. While accuracy is most often cited as the problem
with IE, in fact, I believe that is not the issue. The problem
is better defined as the effort required to get to useful levels of
accuracy, and that level of effort appears to be dropping more
rapidly than people are aware for at least these reasons:
- The difficulty of estimating costs before a system is put in
place. At the words “tuning” purchasers may think
they hear the sound of years and years of rule development.
However, our experience suggests this is not the case.
Applications have remained stable over a number of years (3-4) as
long as the task itself has remained stable.
- IE applications require some investment of time from the
analysts, to carefully define the task that is being
automated. Like the design of a relational database,
this is not a natural approach for most people to take, and some of
the doubts about IE may be a resistance to carefully defining and
making systematic a process that in the past has often been more ad
hoc and intuitive.
These are fundamentally engineering issues. For research, the
future is the pursuit of a “deeper” understanding of
meaning in text. The IE we know today is aimed primarily at
factual understanding – who did what to whom and
when. However, the nuances of text are left undisturbed by
this approach. And in the end what we frequently want to know
is more subtle. What is a certain person’s attitude
toward the people, countries and policies he is discussing?
Is she for or against more Sunni power in the emerging Iraqi
government? What are the interactions between various speakers
or writers in a chat session or email chains? Who is leading
the dialog and pushing what opinions? Finally, what are the
clues to people’s intentions in their speech and
writing? Does certain text suggest the use of coded or
deliberately obscure language? Do certain kinds of writing or
speech suggest greater proclivity for violence? These types of
questions are just surfacing now in the research of language
understanding and IE technology will have a significant role to play
in exploring them more fully.
Close
|