Home | Blog | Technical | About Pablo | Contact
Older Entries | 2021/08/23 | 2021/08/31 | 2022/02/12 | 2022/07/27@ | 2024/02/24 | 2024/08/19

An insider critique of the Jeopardy! QA System

Wed, 27 Jul 2022 15:43:52 -0700

Tags: restrospective, business, technical

permalink

Last year was the 10 year anniversary of the IBM/Jeopardy! show and this year is 12 years since I left the company and immigrated to Canada.

Now that the dust has clearly settled on the show, it is a good time to share my thoughts on the matter. The show gathered a lot of press at the time of airing and afterward experienced continuous academic backlash ever since. Here is a piece from the perspective of an individual contributor to the project.

This post echoes some of the criticism but highlights many positive lessons that might have been overseen. Stick to the end if you want to get the full perspective.

Some history

I joined IBM Research in 2005, as a research scientist working in Question Answering (QA) in research group managed by Dr. Chu-Carroll. We were 4 people and the other researchers have been working on the group's QA system (PiQuAnt) for more than five years, some for more than a decade. When the opportunity to build a Grand Challenge system centered on QA came around, our second-line manager (Dr. Ferrucci) championed the idea with the executives and we joined forces with a sibling group (managed by Dr. Brown) to make it happen.

In the beginning, I was very skeptical of the Grand Challenge. First, I personally never liked trivia and I found solving trivia questions a silly way of using of our collective efforts. Entertainment is a big business but I was fresh out of my PhD and wanted to work on more substantive contributions to the world. Were we to become some sort of clowns? At that time, I pushed for an alternative concept: instead of answering questions, create the game automatically. That resulted in a (overtly too general and finally abandoned) patent application for question generation, which ended up being my most cited publication.

Even though I was critical (more like a PITA, really) of the project, I worked on it day and night (as did the rest of the team) and we managed to push through. Interestingly, once the architectural and ML issues were solved, most of the problems to get it to the finish line were completely outside the realm of research and the control of research contributors. I'm not privy of the type of negotiations between the two large corporations involved (Sony and IBM) but it was a massive undertaking and I credit Dr. Ferrucci for his energy and efforts to make it happen.

To understand the issues at stake, it is important to realize what failure might have looked like: doing a show that was boring and felt like an advertisement would tarnish the Sony/Jeopardy brands but calling for a Grand Challenge with a truly subpar system would in turn tarnish the IBM brand.

While we started with a 10-people team, as the quality of the answers kept improving, the buy-in from the executives increased and the number of people assigned to the project also increased, particularly with scientists from other IBM Research labs. My estimate^[1] is the project took in total about 80 person-years (PYs) to complete.

The last year of the project was profoundly difficult at the personal level for me. I had by then decided my future was not in the US and wanted to immigrate to Canada. My wife was already away doing her PhD at McGill university. There was also a lingering realization on my behalf that the technology put together for the show was too earmarked for the show itself and it didn't have enough promise (to me) that it'll revolutionize AI applications to stay physically separated from my wife. As such, as soon as development was concluded in August 2010, I packed what was left of our apartment and moved to the True North strong and free. I did not attend the show in person, but I watched it at the Montreal hackerspace with newfound friends.

Criticisms

Now, for the criticism of the system and some of its (potentially overseen) breakthroughs:

By far and wide, the most popular criticism is that it was a one-trick pony that didn't help answer AI research questions.

I can agree with that claim, as I mentioned, it was a big part of why I left when I left and why I opposed the project at the beginning. However, in 1996, the Communications of the ACM had a special issue on Natural Language Processing (NLP). The editorial article in the issue (by Yorick Wilks!) expressed that it was unclear whether more work and engineering was enough to solve NLP problems ("bags of tricks") or we were waiting for a scientific breakthrough potentially outside of the NLP realm to succeed. This is a research question and the fact DeepQA (the name used to label the IBM Jeopardy! system) succeeded with tons and tons of great engineering and every trick in the book is a testament that the answer to that question for this particular problem is "yes, engineering is enough." 80 PYs is a ton of money but it is peanuts compared with other large endeavors in engineering (hydropower plant? LHC? etc).

People in academia focus on solving problems with very little personnel and with as much leverage from data, models and hardware as possible. That is not the only way. Humans have a brain and if you give them resources they will buy enough glucose to keep them running and do amazing wonders.
Another criticism I have been told to my face by people in academia is that "if I had the resources of IBM, I would have done something better {insert brilliant idea}."

This way of thinking conflates academic funding (that is earmarked to a particular project) with the intricacies of corporate funding. There are no "resources of IBM" because each corporation is not a monolith but it is composed of different departments and divisions, each one with their own budget-allocated resources and goals. Rallying as many resources from the company is by far one of the achievements that Dr. Ferrucci did for the show I'm most grateful for. I would call David patron saint of intrapreneurs everywhere due to that.
The game was not fair due to {variety of reasons}.

I do find the issue of buzzying-in (electronic signal to the computer vs. red light to the participants) to be a major point of contention. But how to make a "fair" game between a human and 90 POWER7 8-CPU computers ^[2] is a hard topic. I was not involved in the game mechanics part (I contributed to the data source and the ML, particularly at the feature engineering level) but I saw changes over time based on the concept of fairness and its ever-evolving understanding of the system by both Sony and IBM. The game show itself was balanced, the humans put up a fight and they could have won if the system had made more mistakes. An unfair game is not fun to watch so I'd use the fact that the humans watching the game at home enjoyed it as a proxy for fairness.

Game design is a topic I care about. Unless for very specific games and very specific (grandmaster) humans, computers by now can beat any of us. Therefore, when a human plays a game against a computer, the complexity is usually dynamically scaled to make sure the human is not too disappointed and quits. In the graduate-level game design course I took, it was said that in a car racing game, the computer-controlled pack of cars will never overtake the player (will never have a full lap advantage in a circuit race), if the player drives slow, they will slow down. Playing "against" the computer is actually being baby-sit by the computer. Is that fair?
A marketing gimmick shouldn't have such a great exposure.

This is a rewording and it is similar to the point that DeepQA did not help solve the AI question, but it is different because here the complaint is about the amount of exposure. There is a common fallacy among technical people that better ideas deserve more recognition. It is the same people that complain that OS/2 lost to MS-DOS or Betamax tapes are no longer around.^[3] Together with the research team, there were a number of very talented communication professionals (which I never met so I don't know how many were there) that used pre-existing channels to shape a narrative surrounding the show. That work is not research nor technical work but it is a ton of work by all means and it is their success researchers see when they gauge the impact in society.

Now, I'd argue this small blip in collective consciousness regarding AI technology that was the show did help the field as a whole. The larger interest in AI among students, investors and users that followed might have a tiny bit to thank the Jeopardy! show. At the personal level, I credit the DeepBlue match with Kasparov to be an inspiration for me to go into the field. Luck had it, I then worked with some of the original contributors to DeepBlue. A source of personal satisfaction is to think somebody else on the planet watched the show and decided to go into AI thanks to our work.

Some overseen breakthroughs

Now, here are some of the things I loved about that system and I think are easily overseen due to the above:

DeepQA was the peak of feature-based, statistical ML. Now that the field has been overtaken by deep learning approaches, the system is a testimony of what "the other approaches" can achieve.
Tons of human ingenuity made the system. Great software technology allowed to channel much concurrent human work. Building such a complex system with people in different continents was made possible by the underlying framework, which IBM made open source and now lives at the Apache Foundation. Nothing I have seen in the open source NLP arena comes close to the level of interoperability that UIMA provides. Most systems are a solution to their own problems. UIMA solves the problem of using multiple systems and that was key to DeepQA.
We built an end-to-end no-early-commitment system long before that was the norm in NLP. These days it is clear that end-to-end approaches are key to NLP success. We arrived at that realization much earlier than the rest of the field. The traditional way of doing QA by then was to assign an answer type to the question and work from there. Any error at the answer type detection will doom the system early on. Some of my early internal experiments on systems combination informed the team about it and the architecture put together reflected that. A feature-based statistical-ML based system without early commitment is rare to see even today.
If you look at the published diagram of the ML, you might see multiple layers of logistic regression. We basically built a deep network by hand, fine-tuning features and doing transfer learning for some of them. That is why I say we were the peak of feature-based systems.

Figure 1. DeepQA ML, from [Gondek et al, 2012]
Each feature was carefully engineered based on a human intuition for utility. Such intuitions were built in many, many group-wide multi-day error analysis sessions that were dull, difficult and neverending. Sitting through them was not fun and management pushed for them because they worked. We collectively built intuitions about the problem and transferred them into beautiful little annotators in the UIMA framework. I feel the field spends too much time in hyperparameter tuning and too little in error analysis, a topic I argue strongly in my book based on my learnings from the project.

So that is it. Do not let some of the inspiring aspects of the project be overtaken by the naysayers. DeepQA is a great success story for intrapreneurs and tech innovators worldwide. And as all great stories, not everybody is bound to love them.

Comments

^[1] which might be off by a factor of 2 and it is just a guesstimate, I was not involved in any management work.

^[2]That is 2,880 hardware threads with 16Tb of RAM, still impressive, even a decade later.

^[3]And I say this typing this blog in DocBook XML using emacs on a Debian GNU/Linux desktop.

	Home
Copyright © 2010-2024 Pablo Duboue.