Last week five other VCs and I were invited by IBM’s Venture Capital Group to a private viewing of Jeopardy’s final round where Watson was competing. Following the show, and Watson’s win, we engaged in a very interesting conversation with Eric Brown, Watson’s technical lead, and Anant Jhingran, CTO of IBM’s Information Management Division, about the potential applications of the technologies that made Watson’s win possible.
Let’s first talk about the overall system. Watson is a question-answering system that represents 100 person-years effort by a team of 25 IBM scientists. Watson is running on 750 Linux servers with 2880 cores, 15TB of RAM and providing 80 teraflops of computing. Its software architecture is based on UIMA and integrates a variety of machine learning algorithms that have been used to develop approximately 3000 predictive models. These models are running in parallel every time Watson is trying to answer a question. The predictive models were tuned over several months by utilizing Watson’s results from a series of 134 “sparring matches” with past Jeopardy winners. Watson’s “knowledge base” was seeded with 70GB of curated, i.e., noise-free, full-text documents and eventually grew to 500GB, consisting of additional curated documents and information derived from the seed database. The final knowledge base represented approximately 200M pages of textual content. The text data was preprocessed using Hadoop. However, Hadoop was not used while Watson was competing in Jeopardy. Watson’s knowledge base also included various sets of handcrafted rules including rules that provide clues on what to look for and others that describe strategies on how players select within a category so that they can find Jeopardy’s daily double.
Watson operates through a cycle of hypothesizing an answer, gathering evidence that supports each hypothesized answer, evaluating the probabilities associated with the collected evidence and proposing the final answer. As audiences discovered over 3 days of play, all this is done in “warp” speed.
There are several reasons of why we should care about Watson as a Big Data analytics system.
The interaction with Watson through spoken language will be particularly important for the broader use of analytic applications by business users. The use of such applications is often limited because business users are often intimidated by their complex interfaces. Watson demonstrated convincingly that natural spoken language interaction with computers is no longer science fiction or the result of Hollywood special effects teams. In prior posts I wrote about Watson exhibited superior natural language-understanding skills; and this is a really big deal. It was able to address the inherent ambiguity of natural language and understand word-meaning and context. Jeopardy’s statements represent much harder problem descriptions than the queries we typically pose to search engines. They use puns, double meanings and misspellings to convey meaning. While for a human this may be second nature, for a computer it is a very hard task.
Watson is making decisions and is responding to questions by utilizing a knowledge base of textual, unstructured information, rather than a knowledge base of well-structured concepts and rules that has previously digested and represented using a knowledge representation language, such as CYC. In this respect Watson is significantly different from other question-answering systems such as Wolfram|Alpha. Stephen Wolfram wrote an interesting post regarding the different approaches taken by Alpha and Watson.
While the data sets Watson was analyzing during Jeopardy were relatively small by Big Data standards, being able to quickly and effectively analyze unstructured data is representative of many big data analytics situations, where you don’t always know what data you will need to analyze, where it will come from, how large each data set will be, how clean it will be and how long you will have to provide an answer.
Watson concurrently utilizes a large number of predictive models to analyze big data and come up with answers in real time. This is significant because it provides another important approach to analyzing big data, i.e., rather than parallelizing a single analysis algorithm and then using MapReduce to apply it on a big data set, as is typically done in various Hadoop/MapReduce implementations, Watson applies several different predictive and scoring algorithms concurrently. Some of these algorithms may be parallelizable and thus able to take advantage of MapReduce. The application of these algorithms on text data is particularly important to IBM since the majority of its customers possess a lot of such data. Admittedly the data analyzed by Watson was curated and clean. As mentioned above, the majority of big data is not of such quality. IBM will need to test the system’s performance with noisier data, since corporate data is rather noisy, as well as with voice and video data. Moreover, incorporating online data will introduce additional noise which will undoubtedly impact the Watson’s performance. Watson cannot deal with incremental additions to its knowledge base regardless of the form these additions take, i.e., text documents or rules. Such additions necessitate the re-tuning of the predictive models used.
According to IBM, the majority of the software used in Watson is open source. That we can build such a sophisticated system from open source components is a feat in itself.
Of course IBM is very interested in business applications of the technologies that made Watson so successful. These application areas must involve the use of complex language, including the use of such language in the data to be analyzed, the need for reaching high-precision responses/decisions/actions using ambiguous and noisy data, as well as the need to provide responses with confidence in real-time. Some initial application ideas we discussed following the show included:
Medical diagnosis, including telemedicine. The body of medical knowledge is becoming large and growing very fast. Being able to work directly from text rather than having to first represent medical knowledge in some intermediate language, as was the case with expert systems in the past, could represent a big breakthrough.
Technical support. Help desks use mostly text data to provide answers to product issues. Increasing the accuracy of these responses while improving the overall user experience is important.
Insurance claims analysis. This is another area where the captured data is in text form and a system like Watson can be used to analyze it and provide a better user experience when consumers interact with their insurance provider.
Other application areas we discussed included online advertising, air traffic control and financial portfolio creation where Watson’s real-time analytics can be the central component of an overall solution.
The business model under which to offer Watson-like technology is another issue IBM is thinking about. Should such a system be offered as a service or as on-premise software? Of course it will depend on the application. One of the ideas we discussed was to run Watson as a service and charge by the difficulty of the question asked or the importance of the answer provided. This may be something that IBM can do for medical applications, for example, in collaboration with its traditional partners Mayo Clinic and Cleveland Clinic.
The Watson team deserves all the kudos they have received. Not only for their win in Jeopardy! but, more importantly, for their technical accomplishments, the water-cooler discussions that resulted the day after each broadcast which will hopefully make more people (particularly young people) interested in technology and engineering, and the technical conversations they motivated among engineers and scientists about what is possible regarding big data analytics and human/machine interactions.
Last week I attended Strata, a conference organized by O’ Reilly and devoted to big data. I was a large conference (790 attendees) whose content included both technical talks and tutorials about the new generation of big data tools, e.g., Hadoop, Cassandra, visualization, as well presentations on big data business applications. The diversity and size of the audience and the reported business successes provided a strong indication of how important and popular the area of big data has become.
Big data is pervasive in many of the companies Trident has funded the last few years. We have invested in companies that generate and/or process big data, e.g., eXelate, Extole, HomeAway, Sojern, Turn, Xata, as well as companies that provide platforms for storing, managing and analyzing big data, .e.g., Acteea, Host Analytics, Pivotlink. We recognize that many of the companies we invest in the future will need to have competence in big data.
There is a big difference between big data and data warehousing stemming primarily from the nature of the data. Data warehousing was all about analyzing transactional data that was captured from enterprise applications such an ERP or POS system. In addition to the actual transactions, big data is about capturing, storing, managing and analyzing data about the behavior of transactions, i.e., what happens before and after a transaction. This has several implications. First it means that the captured data is less structured. It is easier to analyze a collection of purchasing transactions in order to try to identify a pattern, instead of analyzing a series of selections made across of set of web pages to establish a pattern of behavior. Second it implies that meaning must be extracted from events, e.g., the browsing activity prior to buying an item. To be effective in this more open-ended exploratory data analysis one has to break through the data silos that are typically found in enterprises and bring all available data to bear. It also means that one must be collecting all available data rather than trying to decide a priori which data to collect and keep.
Data science is becoming a field. Big data is eliminating the segregation between the people who manage the data, the people who analyze the data, and the people who present/visualize the data. A good data scientist must be able to do all three, though, as I wrote last week, translating business requirements to a data problem and the resulting insights to business actions and value remain largely missing skills in data scientists. Good data scientists are in high demand, as indicated by the jobs being advertised at the conference and as reported at the conference by LinkedIn. They are expected to play a significant role on how their companies evolve. That’s not something we were used to hearing about data analysts who were always considered fixtures of the back office. I know because I started my career in data analysis.
Corporations have a lot to learn about big data from consumer-oriented companies that generate, manage and analyze big data, e.g., Amazon, eBay, Facebook, Twitter, and LinkedIn to name a few. This is a reversal of sorts. In the mid 90s when I was with IBM I was running an organization that was devoted to building data warehouses and providing analytical tools and services to Global 1000 companies. At that time various companies, including many of the then nascent Internet companies, were trying to learn from the data warehousing and business intelligence practices of Walmart, Citibank, and First Data. Today such companies will do well by trying to understand and apply the big data techniques being developed by many internet and social media companies. One big difference is how such companies approach data stores. Traditional businesses see the enterprise data warehouse as storing the “single version of truth” about the data. Big data stores are viewed as containing multiple perspectives. Their contents must be analyzed with the right set of tools in order to gain a perspective about the problem at hand.
Talking to the conference’s attendees I got the impression that more companies than ever before are starting to view data as an invaluable asset and a potential key to their success. They are no longer intimidated by data volumes and are using the new generation of big data management and analysis tools to bring more data under their control.
Strata was a great conference that brought under one roof the leaders in big data thinking, and doing. It also showed that, though increasingly important, this is still a small community and in many respects its overall size has not changed since the time I was one of the analysts. We all need to find ways to accelerate the education and introduction to market of new data scientists. The ability of many companies to continuously innovate, become leaders, and remain in this position could largely depend on their ability to recruit data scientists who can effectively exploit their big data assets.