In my last blog I tried to define the concept of insight. In this post I discuss insight generation.
Insights are generated by systematically and exhaustively examining a)
the output of various analytic models (including predictive,
benchmarking, outlier-detection models, etc.) generated from a body of
data, and b) the content and structure of the models themselves. Insight generation is a process that takes place together with model generation, but is separate from the decisioning processduring which the generated models, as well as the insights and their associated action plans are applied on new data.
Insight generation depends on our ability to a) collect, organize and
retain data, b) generate a variety of analytic models from that data,
and c) analyze the generated models themselves. Therefore, in order to
generate insights, we must have the ability to generate models. And in
order to do that we must have data. Insights can be generated from
collected data, data derived from the collected data, as well as the
metadata of the collected data. This means that we need to be thinking
not only about the data collection, management and archiving processes,
but also about how to post-process the collected data; what attributes
to derive, what metadata to collect.
In some cases data is collected by conducting reproducible
experiments or simulations (synthetic data). In other cases there is
only one shot at collecting a particular data set. Regardless, insight
generation is highly dependent on how an environment is "instrumented."
For example, consumer marketers have gone from measuring a few
attributes per consumer, think of the early consumer panels run by
companies such as Nielsen, to measuring thousands of attributes,
including consumer web behavior, and most recently, consumer
interactions in social networks. The "right" instrumentation is not
always immediately obvious, i.e., it is not obvious which of the data
that can be captured needs to be captured.
Oftentimes, it may not even be immediately possible to capture
particular types of data. For example, it took some time between the
advent of the web and our ability to capture browsing activity through
cookies. But obviously, the better the instrumentation the better the
analytic models, and thus the higher the likelihood that insights can be
generated. Knowing how to instrument an environment and ultimately how
to use the instrumentation to measure and gather data can be thought of
as an experiment-design process and frequently requires domain
knowledge.
Insight generation also involves the ability to organize murky data,
which is typically the situation with environments involving big data,
and focus on the data that makes "sense," given a specific context and
state of domain knowledge. Focusing on specific data given a particular
data doesn't mean that the rest of the collected data is unimportant.
It's just that one cannot make sense of it at that point in time.
It is important to not only collect and organize data, but also to
properly archive it, since insight generation may only become possible
when a body of archived data is combined with a set of newly collected
data under a particular context. Or that the combination of archived
with new data may lead to additional insights to those
generated in the past. As the body of domain knowledge increases and
new data is collected it may be possible to extract new insights even
from data collected in the past. Consequently, having inexpensive and
scalable big data infrastructures enables this capability.
Insight generation is serendipitous in nature. For this reason,
insights are more likely to be generated from the examination of several
analytic models that have been created from the same body of data
because each model-creation approach considers different characteristics
of the data to identify relations. We maintain that model analysis,
and therefore insight generation, is facilitated when models can be
expressed declaratively. A good example, of the advocated approach is
used by IBM's Watson system. This system uses ensemble learning to
create many expert analytic models. Each created model provides a
different perspective on a specific topic. Watson ensemble learning
approach utilizes optimization, outlier identification and analysis,
benchmarking, etc. techniques in the process of trying to generate
insights.
While we are able to describe data collection and model creation in
quite detailed ways, and have been able to largely automate them, this
is still not the case with insight generation. This is in fact the most
compelling reason for offering insight as a service; because we have
not been able to broadly automate the generation of insights. What we
have characterized as insight today has to be generated manually by the
analysis of each analytic model derived from a body of data, even though
there there is academic research that is starting to point to
approaches for the automatic generation of insights. The analysis of
the derived analytic models will enable us to understand which of the
relations comprising a model are simply correlations supported by
the analyzed data set (but don't constitute insights because they don't
satisfy the other characteristics an insight must possess), and which
are actually meaningful, important and satisfy all the characteristics
we outlined before.
As I mentioned, in most cases today utilizing insights that are
generated manually by experts and offered in the form of a service may
be the only alternative organizations have to fully benefit from the big
data they collect. The best examples are companies like FICO, Exelate,
Opera Solutions, Gaininsight and a few others. However, there are
additional advantages to offering insights as a service:
Certain types of insights, e.g., benchmarking, can only offered as a
service because the provider needs to compare data from a variety of
organizations being benchmarked.
Offering insight as a service could lower the overall cost of
generating and reasoning over insights. This means that even
organizations that can generate insights on their own may ultimately
decide to outsource the insight generation and reasoning processes
because specialized organizations may be able to perform them more
efficiently and cost effectively.
Offering insight as a service enables organizations to benefit from
the expertise the insight generator develops by offering insights to
multiple organizations of the same type. For example, FICO has now
developed tremendous credit insight expertise which no single financial
services organization can replicate.
I wanted to close by making the following point: I have argued that
for an insight to be valid it must have an action associated with it.
This action is applied during a decisioning process. The
characteristics of a particular decisioning process will also need to be
taken into consideration during the insight/action-generation process
because the time (and maybe even other costs) allocated to apply a
particular action during the decisioning process is very important.
Watson's Jeopardy play provided a great illustration of this point, as
the system had a limited amount of time to come up with the correct
response to beat its opponents. Below I provide an initial, rudimentary
illustration of the time it needs to take to action specific actions in
particular domains.
We
are starting to make progress in understanding the difference between
patterns and correlations derived from a data set and insights. This is
becoming particularly important as we are dealing more frequently with
big data but also because we need to use insights to gain a competitive
advantage. Offering insight-generation manual services provides us with
some short term reprieve but ultimately we need to develop automated
systems because the data is getting bigger and our ability to act on it
is not improving proportionately.