Interview: The Future of Human Language Technology

David Murgatroyd

Our VP of Engineering, David Murgatroyd, was recently invited to provide input to a US Government effort to chart the future of Human Language Technology (HLT). Input was collected as a question and answer, with questions bolded. He was asked to respond with respect to one task area: Triage, Translation Support or Knowledge Discovery. He chose Knowledge Discovery but his comments apply to the other two as well. If you have your own questions, please get in touch!

What approaches do you think are particularly promising for the selected goal? For example, do you see anything revolutionary on the horizon?

A promising approach to Knowledge Discovery is to allow users to work in the space of things, not strings. A user of a knowledge discovery system has traditionally crafted keyword queries or tried to make sense of extracted name strings without much help in connecting those bits of text to her mental model of the domain of interest. That mental model deals with key players rather than keywords and is where the real value of HLT is manifested.

Advances in machine learning and human-computer collaboration are enabling knowledge discovery systems that are accurate and adaptable enough to make these mental model connections without a user constantly having to refer back to the source text to double check. But when that provenance is checked or some other interaction occurs, systems can learn from the information so that they are more accurate in the future.

I see these approaches extending earlier in the analysis process than Knowledge Discovery to Triage and later to Translation Support, both of which can benefit from being more centered on real world entities. In the fact, the lines between these three goals may continue to blur as analytic components allow for triage based on semi-automated translation or translation that includes implicit annotation for entities and relationships.

The entity-centric approach also lends itself to tackling the increasingly mixed media present in the source data — documents now not only contain tables or headers but videos and social media quotes. A way to integrate analysis tools across these media is for each tool to deal in the realm of “things” — image software can recognize a specific kind of tank in a picture while HLT recognizes it described in the accompanying text.

What do you have an intuition about? What bright idea is nagging at you?

I continue to be concerned at the apparent number of HLT systems that take a “translation for integration” approach, that is they translate content into English for consumption by English-only systems early in the life of that data. This exacerbates the imperfections of both translation and the downstream English only systems. I instead advocate for a “translation for presentation” architecture where translation is delayed until the moment that content needs to be presented to an end-user who cannot read it otherwise.

How has your work changed in the last 5 years?

The move toward entity-centric systems has heightened the need for tools that easily adapt to the domain of the problem of the moment. This is because different problem domains often have different views of what the important ‘things’ are. We increasingly use ‘field training toolkits’ to help our users adapt our systems to their domains. We try to build them so that they adapt mainly based on data naturally produced as part of existing workflows rather than through specialized annotation or rule writing.

What is/are the most significant technological and policy roadblocks that exist in your work? What impact do the roadblocks have?

Necessary policies around clearance, accreditation, and need-to-know add significant friction into the feedback loop when building commercial products targeted in part at government use cases. Often broad-based feedback for the same products from the private sector is easier to obtain because these frictions are less significant. This means that government may have less opportunities to enjoy the economy of scale that can come from acquiring common products and instead rely on more specialized solutions than might otherwise be necessary. Perhaps with the advent of private clouds within the government there will be more opportunity to observe and measure the use of products deployed in the cloud, thereby giving implicit feedback.

What would be the extreme performance breakthrough whether from emerging revolutionary technological capabilities or fundamental discoveries in science, technology or mathematics?

The use of GPUs for ‘General Processing’ may lead to significant speed improvements over the next several years. Improved machine learning centered around deep neural nets may lead to significant accuracy improvements.

If you had unlimited funds, what would you work on?

Probably the same things since I must presume that the funds which are available are targeted at the most operationally important problems. Our team is primarily motivated by solving those problems and secondarily by cleanly crafting the technology that makes up those solutions.

David Murgatroyd is the VP, Engineering at Basis Technology. He joined the company in 2005. He leads the engineering team responsible for text analytics including existing products and new technology initiatives. He has been building natural language processing systems since 1998, including positions at Unveil Technologies, Zoesis, Wildfire Communications, and iConverse. He has a B.S. in computer science and a B.A. in computational and applied mathematics from Rice University and a Master’s degree in computer speech and language processing from Cambridge University, U.K