Last week five other VCs and I were invited by IBM’s Venture Capital Group to a private viewing of Jeopardy’s final round where Watson was competing. Following the show, and Watson’s win, we engaged in a very interesting conversation with Eric Brown, Watson’s technical lead, and Anant Jhingran, CTO of IBM’s Information Management Division, about the potential applications of the technologies that made Watson’s win possible.
Let’s first talk about the overall system. Watson is a question-answering system that represents 100 person-years effort by a team of 25 IBM scientists. Watson is running on 750 Linux servers with 2880 cores, 15TB of RAM and providing 80 teraflops of computing. Its software architecture is based on UIMA and integrates a variety of machine learning algorithms that have been used to develop approximately 3000 predictive models. These models are running in parallel every time Watson is trying to answer a question. The predictive models were tuned over several months by utilizing Watson’s results from a series of 134 “sparring matches” with past Jeopardy winners. Watson’s “knowledge base” was seeded with 70GB of curated, i.e., noise-free, full-text documents and eventually grew to 500GB, consisting of additional curated documents and information derived from the seed database. The final knowledge base represented approximately 200M pages of textual content. The text data was preprocessed using Hadoop. However, Hadoop was not used while Watson was competing in Jeopardy. Watson’s knowledge base also included various sets of handcrafted rules including rules that provide clues on what to look for and others that describe strategies on how players select within a category so that they can find Jeopardy’s daily double.
Watson operates through a cycle of hypothesizing an answer, gathering evidence that supports each hypothesized answer, evaluating the probabilities associated with the collected evidence and proposing the final answer. As audiences discovered over 3 days of play, all this is done in “warp” speed.
There are several reasons of why we should care about Watson as a Big Data analytics system.
The interaction with Watson through spoken language will be particularly important for the broader use of analytic applications by business users. The use of such applications is often limited because business users are often intimidated by their complex interfaces. Watson demonstrated convincingly that natural spoken language interaction with computers is no longer science fiction or the result of Hollywood special effects teams. In prior posts I wrote about Watson exhibited superior natural language-understanding skills; and this is a really big deal. It was able to address the inherent ambiguity of natural language and understand word-meaning and context. Jeopardy’s statements represent much harder problem descriptions than the queries we typically pose to search engines. They use puns, double meanings and misspellings to convey meaning. While for a human this may be second nature, for a computer it is a very hard task.
Watson is making decisions and is responding to questions by utilizing a knowledge base of textual, unstructured information, rather than a knowledge base of well-structured concepts and rules that has previously digested and represented using a knowledge representation language, such as CYC. In this respect Watson is significantly different from other question-answering systems such as Wolfram|Alpha. Stephen Wolfram wrote an interesting post regarding the different approaches taken by Alpha and Watson.
While the data sets Watson was analyzing during Jeopardy were relatively small by Big Data standards, being able to quickly and effectively analyze unstructured data is representative of many big data analytics situations, where you don’t always know what data you will need to analyze, where it will come from, how large each data set will be, how clean it will be and how long you will have to provide an answer.
Watson concurrently utilizes a large number of predictive models to analyze big data and come up with answers in real time. This is significant because it provides another important approach to analyzing big data, i.e., rather than parallelizing a single analysis algorithm and then using MapReduce to apply it on a big data set, as is typically done in various Hadoop/MapReduce implementations, Watson applies several different predictive and scoring algorithms concurrently. Some of these algorithms may be parallelizable and thus able to take advantage of MapReduce. The application of these algorithms on text data is particularly important to IBM since the majority of its customers possess a lot of such data. Admittedly the data analyzed by Watson was curated and clean. As mentioned above, the majority of big data is not of such quality. IBM will need to test the system’s performance with noisier data, since corporate data is rather noisy, as well as with voice and video data. Moreover, incorporating online data will introduce additional noise which will undoubtedly impact the Watson’s performance. Watson cannot deal with incremental additions to its knowledge base regardless of the form these additions take, i.e., text documents or rules. Such additions necessitate the re-tuning of the predictive models used.
According to IBM, the majority of the software used in Watson is open source. That we can build such a sophisticated system from open source components is a feat in itself.
Of course IBM is very interested in business applications of the technologies that made Watson so successful. These application areas must involve the use of complex language, including the use of such language in the data to be analyzed, the need for reaching high-precision responses/decisions/actions using ambiguous and noisy data, as well as the need to provide responses with confidence in real-time. Some initial application ideas we discussed following the show included:
Medical diagnosis, including telemedicine. The body of medical knowledge is becoming large and growing very fast. Being able to work directly from text rather than having to first represent medical knowledge in some intermediate language, as was the case with expert systems in the past, could represent a big breakthrough.
Technical support. Help desks use mostly text data to provide answers to product issues. Increasing the accuracy of these responses while improving the overall user experience is important.
Insurance claims analysis. This is another area where the captured data is in text form and a system like Watson can be used to analyze it and provide a better user experience when consumers interact with their insurance provider.
Other application areas we discussed included online advertising, air traffic control and financial portfolio creation where Watson’s real-time analytics can be the central component of an overall solution.
The business model under which to offer Watson-like technology is another issue IBM is thinking about. Should such a system be offered as a service or as on-premise software? Of course it will depend on the application. One of the ideas we discussed was to run Watson as a service and charge by the difficulty of the question asked or the importance of the answer provided. This may be something that IBM can do for medical applications, for example, in collaboration with its traditional partners Mayo Clinic and Cleveland Clinic.
The Watson team deserves all the kudos they have received. Not only for their win in Jeopardy! but, more importantly, for their technical accomplishments, the water-cooler discussions that resulted the day after each broadcast which will hopefully make more people (particularly young people) interested in technology and engineering, and the technical conversations they motivated among engineers and scientists about what is possible regarding big data analytics and human/machine interactions.