Haven’t I Met You Before? Cross-Document Coreference Resolution

The Dude: Nobody calls me Lebowski. You got the wrong guy. I’m the Dude, man.
Blond Treehorn Thug: Your name’s Lebowski, Lebowski. Your wife is Bunny.
The Dude: My… my wi-, my wife, Bunny? Do you see a wedding ring on my finger? Does this place look like I’m married?

The example above highlights the importance of coreference resolution, which is sorting out names in text that either appear to refer to the same real-world thing, but don’t (as in the example above), or appear not to refer to the same real-world thing, but do (“Arnold Schwarzenegger” and “the Governator”). This blog post talks about our work on a coreference-resolution system—why we’re building it, and how we’re going about it.

A search for “Michael Jackson” in our demonstration program distinguishes between documents about the famous king of pop and those about the chairman of AutoNation.

Coreference systems come in two flavors: in-document (indoc) or cross-document. Indoc systems worry about reference chains like, “President Barack Obama,” “Obama,” “The President,”, and even “him,” when they all appear in the same document. These are referred to as “coreference chains” because all the mentions referring to Obama are chained together. Cross-document coref (CDCR) systems worry about the existence of whole documents. For example, two documents containing “George Bush” might be referring to one president, two presidents, or some other people altogether.

CDCR is essential to large-scale exploitation of information in documents. Names are critical components of document content, and if you can’t group together name references that refer to the same real thing, then you can’t really figure out what the documents are telling you.

CDCR involves disambiguating names (e.g., Elizabeth Taylor the actress and Elizabeth Taylor the journalist) and coping with disparate documents—from different sources and domains, different writing styles, different languages, etc. CDCR relies heavily on document similarity, so it’s a natural fit for clustering algorithms. The majority of CDCR works are built upon the approach proposed by Bagga and Baldwin in 1998, which in general consists of (1) connecting the related chains from each document (a process called “creating co-reference chains”); (2) extracting contextual features for each chain; and (3) clustering the chains. In most systems, the latter also includes generating likely candidates for clustering for every given name.

While it sounds trivial, the task is actually quite challenging. Like many natural language processing (NLP) problems, CDCR depends on other components such as named entity recognition (NER), in-document coreference, and matching of similar names. These are not easy tasks when dealing with documents from different sources and writers. If any of these three components perform poorly, the CDCR system will too. When it comes to CDCR, two problems in particular make it so interesting:

Disambiguation

Disambiguation is the task of distinguishing between the two people named “George Bush,” or even the many people named “Joe Smith.” The system works by extracting contextual hints for every name. Humans look at the “gestalt” of an entire document (or at least a paragraph or so) and get an immediate gut sense of “what it’s about” and whether two documents are talking about the same thing or two different things. Translating that into an automatic procedure is difficult. The critical information is sparse, and it is often more implied than explicit.

Extracting context from a small window might miss important hints, while the opposite can introduce too much noise. One of the top performing systems at the WePS-3 Person Name Disambiguation Task addressed this by incorporating the distance of a term from a mention into their weighting scheme, so that closer terms are weighted higher (Long and Shi, CLEF 2010). While this is a good idea, it’s also worth investing in linguistic features. For example, a system may extract appositives as an indication of important content; in the sentence “Mark Zuckerberg, the founder of Facebook, …”, “the founder of Facebook” is an appositive phrase.

And there are other approaches, too. Entity linking is a hot topic in NLP these days. Entity linking involves linking name mentions in text to appropriate entries in a knowledge base such as Wikipedia. Once the system successfully links a mention, the knowledge base becomes another source of information to help in disambiguation. Entity linking is essentially a realistic view of the CDCR problem; in the real world, there’s usually some sort of knowledge base to work with for at least some of the names in a body of text.

Not All Names Are Created Equal

The “buttered-side-down” principle applies to CDCR as it does to all NLP problems. Any system you build will make some mistakes that seem incomprehensible to people. Consider Hawaiian musician Israel Kamakawiwo’ole and Joe Smith. If the system pays too much attention to context (good contextual hints can be “singer” or “Hawaii”), it might make a mistake and decide that there are two of those Israel’s when in fact there is only one (recall errors). This error might happen, if Israel is mentioned in one document context as being a musician only and in another only as being Hawaiian, and the documents discuss different topics. A person will find it ridiculous that the system even remotely considered the possibility that two people have that name. On the other hand, a system that takes context less seriously is likely to link many examples of “Joe Smith” to each other, risking derision from a human looking at the data and seeing that the “Joe Smith”s are different (precision errors).

Names have a skewed distribution; some names are more famous or common; some mentions contain initials and middle names, and some just a single name. Here at Basis Technology, we have had a long-running debate about how to handle single token name mentions. Single mentions are tricky—they can be unambiguous when referring to famous people (e.g., Messi, Adele, Jobs), or highly ambiguous (e.g., Alex, Wilkinson). The latter usage seems strange but it’s actually quite common in news, especially when the person mentioned is not the main topic of the article, and the full name is only mentioned implicitly. For example, from a Gigaword article about George Best:

“Best was accompanied by his 29-year-old wife, Alex, when he checked into the hospital…”

The full name George Best is mentioned previously in the article, but this is the only occurrence of his wife Alex. Very strong contextual agreement should be present in order to chain Alex with some other Alex in a different document.

Finally, an important aspect is evaluating the performance of the system, and as always it boils down to the quality of the annotated data. Annotated data with a low level of ambiguity and name variations, without single name mentions and other difficult cases, will not reflect the actual challenges existing in many domains. On such data it’d be actually hard to beat a naive system that chains together any two names with case insensitive exact match. This issue is nothing new or unique to CDCR of course; in most NLP tasks the evaluation is highly data dependent. There are a few useful English benchmarks for CDCR such as TAC 2011 (Low-Medium Ambiguity), WePS (High Ambiguity), and ACE 2008 (Medium Ambiguity). Here at Basis Technology we also annotated over 100,000 names in Gigaword to better evaluate the accuracy of CDCR at scale.

This post is just a high-level overview of coreference. Like in most NLP problems, the devil is in the details…