Download Infosys Labs Briefings...
“At Infosys Labs, we constantly look for opportunities to leverage
Senior Vice President
technology while creating and implementing innovative business
and Head of Infosys Labs
Infosys Labs Briefings
Subu Goparaju
solutions for our clients. As part of this quest, we develop engineering methodologies that help Infosys implement these solutions right,
For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: [emailprotected]
© Infosys Limited, 2013
BIG DATA: CHALLENGES AND OPPORTUNITIES
first time and every time.”
Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.
VOL 11 NO 1 2013
makes no express or implied warranties relating to the information contained in this document or to
Infosys Labs Briefings VOL 11 NO 1 2013
BIG DATA: CHALLENGES AND OPPORTUNITIES
$ £¥ € €
¥ £
$
Big Data: Countering Tomorrow’s Challenges Infosys Labs Briefings Advisory Board
Anindya Sircar PhD Associate Vice President & Head - IP Cell Gaurav Rastogi Vice President, Head - Learning Services Kochikar V P PhD Associate Vice President, Education & Research Unit Raj Joshi Managing Director, Infosys Consulting Inc. Ranganath M Vice President & Chief Risk Officer Simon Towers PhD Associate Vice President and Head - Center of Innovation for Tommorow’s Enterprise, Infosys Labs Subu Goparaju Senior Vice President & Head - Infosys Labs
Authors featured in this issue AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [emailprotected].
Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.
ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].
As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.
BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [emailprotected].
Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies. Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data. Like always do let us know your feedback about the issue. Happy Reading,
AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [emailprotected]. ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].
GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [emailprotected]. KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [emailprotected]. MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [emailprotected]. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [emailprotected]. NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [emailprotected]. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [emailprotected]. PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [emailprotected]. PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [emailprotected]. PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [emailprotected].
Yogesh Dandawate Deputy Editor [emailprotected]
SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [emailprotected]. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [emailprotected]. SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [emailprotected]. ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [emailprotected].
Infosys Labs Briefings VOL 11 NO 1 2013
Opinion: Metadata Management in Big Data By Gautham Vemuganti Any enterprise that is in the process of or considering Big data applications deployment has to address the metadata management problem. The author proposes a metadata management framework to realize Big data analytics.
Trend: Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj The paper tries to explore the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization.
Discussion: Retail Industry – Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu Big data analysis of customers’ preferences can help retailers gain a significant competitive advantage, suggest the authors.
Perspective: Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD Always-on digital customers continuously create more data in various types. Enterprise are analyzing this heterogeneous data for understanding customer behavior, spend, social media patterns.
Framework: Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). The author proposes an iterative framework for effective liquidity risk management.
Model: Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi In this paper the authors describe how Big data analytics can play a significant role in the early detection and diagnosis of fatal diseases, reduction in health care costs improving quality of health care administration.
Approach: Big Data Powered Extreme Content Hub
3
9
19
27
35
41
47
By Sudeeshchandran Narayanan and Ajay Sadhu With the arrival of Big Content, the need to extract, enrich, organize and manage semi-structured and un-structured content and media is increasing. This paper talks about the need for an Extereme Content Hub to tame the Big data explosion.
Insight: Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur Complex Event Processing along with in-memory data grid technologies can help in pattern detection, matching, analysis, processing and split second decision making in Big data scenarios opine the authors.
Practioners Perspective: Big Data: Testing Approach to Overcome Quality Challenges
53
65
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja This paper suggests the need for a robust testing approach to validate Big data systems to identify possible defects early in the implementation life cycle.
Research: Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash Classical visualization methods are falling short in accurately representing the multidimensional and ever growing Big data. Taking inspiration from nature, the author has proposed a nature inspired spider cobweb visualization technique for visualization of Big data. Index
73
“Robust testing approach needs to be defined for validating structured and unstructured data to identify possible defects early in the implementation life cycle.”
Naju D. Mohan Delivery Manager, RCL Business Unit Infosys Ltd.
“Big Data augmented with Complex Event Processing capabilities can provide solutions in utilizing memory data grids for analyzing trends, patterns and events in real time.”
Bill Peer Principal Technology Architect Infosys Labs, Infosys Ltd.
Infosys Labs Briefings VOL 11 NO 1 2013
Metadata Management in Big Data By Gautham Vemuganti
Big data analytics must reckon the importance and criticality of metadata
B
ig data, true to its name, deals with large
information management solution, its metadata
volumes of data characterized by volume,
should be correctly defined.
variety and velocity. Any enterprise that is
Metadata management solutions
in the process of or considering a Big data
provided by various vendors usually have
applications deployment has to address the
a narrow focus.An ETL vendor will capture
metadata management problem. Traditionally,
metadata for the ETL process.A BI vendor will
much of the data that business users use is
provide metadata management capabilities
structured. This however is changing with the
for their BI solution. The silo-ed nature of
exponential growth of data or Big data.
metadata does not provide business users an
Metadata defining this data, however,
opportunity to have a say and actively engage
is spread across the enterprise in spreadsheets,
in metadata management. A good metadata
databases, applications and even in people’s
management solution must provide visibility
minds (the so-called “tribal knowledge”). Most
across multiple solutions and bring business
enterprises do not have a formal metadata
users into the fold for a collaborative, active
management process in place because of
metadata management process.
the misconception that it is an Information Technology (IT) imperative and it does not have
METADATA MANAGEMENT CHALLENGES
an impact on the business.
Metadata, simply defined, is data about data.
However, the converse is true. It has been
In the context of analytics some common
proven that a robust metadata management
examples of metadata are report definitions,
process is not only necessary but required for
table definitions, meaning of a particular master
successful information management. Big data
data entity (sold-to customer, for example),
introduces large volumes of unstructured data
ETL mappings and formulas and computations.
for analysis. This data could be in the form of a
The importance of metadata cannot be
text file or any multimedia file (for e.g., audio,
overstated. Metadata drives the accuracy of
video). To bring this data into the fold of an
reports, validates data transformations, ensures
3
Multiple governance process
Single monolithic governance process
People
Rules
Metrics Process
People
Rules
Metrics
Process People
Rules
People
Metrics
Process
Metrics Process
Rules
Figure 1: Data Governance Shift with Big Data Analytics
Source: Infosys Research
accuracy of calculations and enforces consistent
a report or a calculation is run or two divisional
definition of business terms across multiple
data sources are merged.
business users.
Metadata is typically viewed as the
In a typical large enterprise which has
exclusive responsibility of the IT organization
grown by mergers, acquisitions and divestitures,
with business having little or no input or say in
metadata is scattered across the enterprise in
its management. The primary reason is that there
various forms as noted in the introduction.
are multiple layers of organization between IT
In large enterprises, there is wide
and business. This introduces communication
acknowledgement that metadata management
barriers between IT and business.
is critical but most of the time there is no
Finally, metadata is not viewed as a very
enterprise level sponsorship of a metadata
exciting area of opportunity.It is only addressed
management initiative.Even if there is, it is only
as an after-thought.
focused either for one specific project sponsored by one specific business.
DIFFERENCES BETWEEN TRADITIONAL
The impact of good metadata
AND BIG DATA ANALYTICS
management practices are not consistently
In traditional analytics, implementations
understood across the various levels of the
data is typically stored in a data warehouse.
enterprise. Conversely, the impact of poorly
The data warehouse is modeled using one
managed metadata comes to light only after
of several techniques, developed over time
the fact i.e., a certain transformation happens,
and is a constantly evolving entity. Analytics
4
application developed using the data in a data
the inference engine has to rely on metadata
warehouse are also long-lived. Data governance
as well as the supporting domain ontology.
in traditional analytics is a centralized process.
The metadata will define “Wireless Carrier”,
Metadata is managed as part of the data
“Customer”, “Sentiment” and “Intent”.The
governance process.
inference engine will leverage the ontology
In traditional analytics, data is discovered,
dependent on this metadata to infer that this
collected, governed, stored and distributed.
customer wants to switch cell phone carriers.
Big data introduces large volumes of
Big data is not just restricted to text.It
unstructured data.This data changes is highly
could also contain images, videos, and voice
dynamic and therefore needs to be ingested
files. Understanding, categorizing and creating
quickly for analysis.
metadata to analyze this kind of non-traditional
Big data analytics applications,
content is critical.
however, are characterized by short-lived,
It is evident that Big data introduces
quick implementations focused on solving a
additional challenges in metadata management.It
specific business problem.The emphasis of
is clear that there is a need for a robust metadata
Big data analytics applications is more on
management process which will govern metadata
experimentation and speed as opposed to long
with the same rigor as data for enterprises to be
drawn out modeling exercise.
successful with Big data analytics.
The need to experiment and derive
To summarize, a metadata management
insights quickly using data changes the way
process specific to Big data should incorporate
data is governed. In traditional analytics
the context and intent of data, support non-
there is usually one central governance team
traditional sources of data and be robust to
focused on governing the way data is used
handle the velocity of Big data.
and distributed in the enterprise.In Big data analytics, there are multiple governance
ILLUSTRATIVE EXAMPLE
processes in play simultaneously, each geared
Consider an existing master data management
towards answering a specific business question.
system in a large enterprise.This master data
Figure 1 illustrates this.
system has been developed over time.This
Most of the metadata management
has specific master data entities like product,
challenges we referred to in the previous section
customer, vendor, employee etc.The master data
alluded to typical enterprise data that is highly
system is tightly governed and data is processed
structured. To analyze unstructured data,
(cleansed, enriched and augmented) before it is
additional metadata definitions are necessary.
loaded into the master data repository.
To illustrate the need to enhance metadata
This specific enterprise is considering
to support Big data analytics, consider sentiment
bringing in social media data for enhanced
analysis using social media conversations as
customer analytics.This social media data is to be
an example. Say someone posts a message on
sourced from multiple sources and incorporated
Facebook “I do not like my cell-phone reception.
into the master data management system.
My wireless carrier promised wide cell coverage
As noted earlier, social media
but it is spotty at best.I think I will switch
conversations have context, intent and
carriers”. To infer the intent of this customer,
sentiment.The context refers to the situation
5
in which a customer was mentioned, the intent
for this master data will now have to incorporate
refers to the action that an individual is likely
the social media data as well. Furthermore, the
to take and the sentiment refers to the “state of
customer information extracted from these feeds
being” of the individual.
need to be standardized before being loaded into
For example, if an individual sent a
any transaction system.
tweet or a starts a Facebook conversation about a retailer from a football game. The context
FRAMEWORK FOR METADATA
would then be a sports venue. If the tweet or
MANAGEMENT IN BIG DATA ANALYTICS
conversation consisted of positive comments
We propose that metadata be managed using
about the retailer then the sentiment would be
5 components shown in Figure 2.
determined as positive. If the update consisted of highlighting a promotion by the retailer then
Metadata Discovery – Discovering metadata
the intent would be to collaborate or share with
is critical in Big data for the reasons of context
the individual’s network.
and intent noted in the prior section. Social data
If such social media updates have to
is typically sourced from multiple sources.All
be incorporated into any solution within the
these sources will have different formats. Once
enterprise then the master data management
metadata for a certain entity is discovered for
solution has to be enhanced with metadata about
one source it needs to be harmonized across all
“Context”, ”Sentiment” and “Intent”. Static
sources of interest. This process for Big data
lookup information will need to be generated
will need to be formalized using metadata
and stored so that an inference engine can
governance.
leverage this information to provide inputs for analysis. This will also necessitate a change in the
Metadata Collection – A metadata collection
back-end.The ETL processes that are responsible
mechanism should be implemented. A robust collection mechanism should aim to minimize or eliminate metadata silos. Once again, a technology or a process for metadata collection
METADATA DISCOVERY
Collect
should be implemented. Metadata Governance – Metadata creation
METADATA COLLECTION
and maintenance needs to be governed. Governance should include resources from
METADATA GOVERNANCE
both the business and IT teams. A collaborative framework between business and IT should be established to provide this governance.
METADATA STORAGE
Appropriate processes (manual or technical) should be utilized for this purpose. For example,
METADATA DISTRIBUTION
on-boarding a new Big data source should be a collaborative effort between business users and IT. IT will provide the technology to enable
Figure 2: Metadata Management Framework for Big Data Analytics Source: Infosys Research
business users discover metadata.
6
METADATA DISCOVERY
Collect
DATA DISCOVERY
METADATA COLLECTIO N
Collect
DATA COLLECTION
METADATA GOVERNANCE
DATA GOVERNANCE
METADATA STORAGE
DATA STORAGE
METADATA DISTRIBUTIO N
DATA DISTRIBUTION
BIG DATA DISTRIBUTION Figure 3: Equal Importance of Metadata & Data Processing for Big Data Analytics
Source: Infosys Research
Metadata Storage – Multiple models for
The metadata management framework
enterprise metadata storage exist.The Common
should be implemented alongside a data
Warehouse Meta-model (CWM) is one example.
management framework to realize Big data analytics.
A similar model or its extension thereof can be utilized for this purpose.If one such model will
THE PARADIGM SHIFT
not fit the requirements of an enterprise then
The discussion in this paper brings to light the
suitable custom models can be developed.
importance of metadata and the impact it has not only on Big data analytics but traditional
Metadata Distribution – This is the final
analytics as well.We are of the opinion that if
component. Metadata, once stored will need
enterprises want to get value out of their data
to be distributed to consuming applications.A
assets and leverage the Big data tidal wave then
formal distribution model should be put into
the time is right to shift the paradigm from
place to enable this distribution. For example,
data governance to metadata governance and
some applications can directly integrate to
make data management part of the metadata
the metadata storage layer while others will
governance process.
need some specialized interfaces to be able to
A framework is as good as how it is
leverage this metadata.
viewed and implemented within the enterprise.
We note that in traditional analytics
The metadata management framework is
implementation, a framework similar to the one
successful if there is sponsorship for this effort
we propose exists but with data.
from the highest levels of management.This
7
include both business and IT leadership within
needed to analyze Big data to companies engaged
the enterprise. The framework can be viewed as
in Big data analysis and selling that content.
being very generic. Change is a constant in any
In the midst of all the innovation in the
enterprise.The framework can be made flexible
Big data space, metadata is often forgotten. It
to adapt to changing needs and requirements
is important for us to recognize and realize the
of the business.
importance of metadata management and the
All the participants and personas in
critical impact it has on enterprises.
engaged in the data management function within
If enterprises wish to remain competitive,
an enterprise should participate in the process.
they have to embark on Big data analytics
This will promote and foster collaboration
initiatives.In this journey, enterprises cannot
between business and IT.This should be made
afford to ignore the metadata management
sustainable and followed diligently by all the
problem.
participants until this framework is used to onboard not only new data sources but also new
REFERENCES
participants in the process.
1. Davenport, T., and Harris, J., (2007),
Metadata and its management is an
Competing on Analytics – The New
oft ignored area in enterprises with multiple
Science of Winning, Harvard Business
consequences.The absence of robust metadata
School Press.
management processes lead to erroneous results,
2. J e n n i n g s , M . , W h a t r o l e d o e s
project delays and multiple interpretations of
metadata management play in
business data entities. These are all avoidable
enterprise information management
with a good metadata management framework.
(EIM)?. Available at http://
The consequences affect the entire
searchbusinessanalytics.techtarget.com/
enterprise either directly or indirectly.From
answer/The-importance-of-metadata-
the lowest level employee to the senior
management-in-EIM.
most executive, incorrect or poorly managed
3. Metadata Management Foundation
metadata not only will affect operations but also
Capabilities Component (2011). http://
directly contribute to the top-line growth and
mike2.openmethodology.org/wiki/
bottom-line profitability of an enterprise. Big
Metadata_Management_Foundation_
data is viewed as the most important innovation
Capabilities_Component.
that brings tremendous value to enterprises.
4. Rogers, D. (2010), Database Management:
Without a proper metadata management
Metadata is more important than you think.
framework, this value might not be realized.
Available at http://www.databasejournal. com/sqletc/article.php/3870756/ Database-Management-Metadata-is-more-
CONCLUSION Big data has created quite a bit of buzz in the
important-than-you-think.htm.
market place.Pioneers like Yahoo and Google
5. Data Governance Institute, (2012), The
created the foundations of what is today called
DGI Data Governance Framework.
Hadoop.There are multiple players in the Big
Available a t http://datagovernance.
data market today developing everything from
com/fw_the_DGI_data_governance_
technology to manage Big data to applications
framework.html.
8
Infosys Labs Briefings VOL 11 NO 1 2013
Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj
Enterprises need to adopt different Big data analytic tools and technologies to improve their supply chains
I
n today’s competitive ‘lead or leave’
areas that are undergoing transformational
marketplace, Big data is seen as an
changes in the recent past. Traditional supply
oxymoron that offers both challenge as well as
chain applications leverage only on transactional
opportunity. Effective and efficient strategies
data to solve operational problems and improve
to acquire, manage and analyze data leads
efficiency. Having stepped into Big data world,
to better decision making and competitive
the existing supply chain applications have
advantage. Unlocking potential business
become obsolete as they are unable to cope up
value out of this diverse and multi-structured
with tremendously increasing data volumes
dataset beyond organizational boundary is a
cutting across multiple sources, the speed with
mammoth task.
which they are generated and unprecedented
We have stepped into an interconnected
growth in new data forms.
and intelligent digital world where convergence
Enterprises are in tremendous pressure
of new technologies is fast happening round
to solve new problems emerging out of new
the corners. In this process the underlying
forms of data. Handling large volume of data
data set is growing not only in volumes but
across multiple sources and deriving value out
also in velocity and variety. The resulting data
of this massive chunk for strategy execution
explosion created by a combination of mobile
is the biggest challenge that enterprises are
devices, tweets, social media, blogs, sensors and
facing in today’s competitive landscape.
emails demands a new kind of data intelligence.
Careful analysis and appropriate usage of
Big data has started creating lot of buzz
these data would result in cost-reduction and
across verticals and Big data in supply chain is
better operational performance. Competitive
no different. Supply chain is one of the key focus
pressures and customers ‘more for less’
9
attitudes have left enterprise with no choice
chain planning where Big data can create an
other than to re-think on their supply chain
impact are: demand forecasting, inventory
strategies and creating a differentiation.
management, production planning, vendor
Enterprises need to adopt appropriate
management and logistics optimization. Big
Big data techniques and technologies and
data can improve supply chain planning process
build suitable models to derive value out
if appropriate business models are identified,
of these unstructured data and henceforth
designed, built and then executed. Some of
plan, schedule and route in a cost-effective
its key benefits are: short time-to-market,
manner. The paper tries to explore what are
improved operational excellence, cost reduction
the challenges that dot the Big data adoption in
and increased profit margins.
supply chain and proposes a value model for Big data optimization.
CHALLENGES WITH SUPPLY CHAIN PLANNING
BIG DATA WAVE
Supply
International Data Corporation (IDC) has
depends
chain
predicted that Big data market will grow from
forecasted, inventories are managed and
$3.2 billion in 2010 to $16.9 billion by 2015
logistics are planned. Supply chain is the
at a compound annual growth rate of 40%
heart of industry vertical and if managed
[2]. This shows tremendous traction towards
efficiently drives positive business and enables
Big data tools, technologies and platforms
sustainable advantage. With the emergence of
among enterprises. Lots of researches and
Big data, optimizing supply chain processes
investments are carried out on how to fully tap
has become complicated than ever before.
the potential benefits hidden in Big data and
Handling Big data challenges in supply chain
derive financial value out of it. Value derived
and transforming them into opportunities
out of Big data enables enterprises to achieve
is the key to corporate success. The key
differentiation by reducing cost, efficient
challenges are:
on
planning how
process
closely
success
demands
are
planning and thereby improving process efficiency.
■■ Volume - According to a McKinsey
Big data is an important asset in supply
report, the number of RFID tags sold
chain which enterprises are looking forward
globally is projected to increase from
to capitalize upon. They adopt different Big
12 million in 2011 to 209 billion in
data analytic tools and technologies to improve
2021 [3]. Along with this, phenomenal
their supply chain, production and customer
increase in the usage of temperature
engagement processes. The path towards
sensors, QR codes and GPS devices, the
operational excellence is facilitated through
underlying supply chain data generated
efficient planning and scheduling of production
has multiplied manifold beyond our
and logistic processes.
expectations. Data is flowing across
Though supply chain data is really huge,
multiple systems and sources and hence
it brings about the biggest opportunity for
they are likely to be error-prone and
enterprises to reduce cost and improve their
incomplete. Handling such huge data
operational performances. The areas in supply
volumes is a challenge.
10
Launch Promotion
Customer
Inventory
Transportation
Data Sourcing Sensor RFID QR
Structured
Unstructured
Transactional
Social
Time bound
Channel
Temperature
New Type Video Voice Digital Image
Data Extraction & Cleansing Transactional Systems
Big Data Systems Cascading | Hive Pig | MapReduce HDFS | NoSQL
OLTP
DB
Data Representation Acquire Source: Infosys Research
Figure 1: Optimization Model for Improving Supply Chain Visibility - I
■■ Velocity - Business has become highly
(temperature and RFID) along with
dynamic and volatile. The changes arising
new data types (video, voice and digital
due to unexpected events must be handled
images) have created nightmares among
in a timely manner in order to avoid losing
enterprise to handle such diverse and
out in business. Enterprises are finding it
heterogeneous data sets.
extremely difficult to cope up with this data velocity. Optimal decisions must
In today’s data explosion in terms
be made quickly and shorter processing
of volume, variety and velocity, handling
time is the key for successful operational
them alone doesn’t suffice. Value creation by
execution which is lacking in traditional
analyzing such massive data sets and extraction
data management systems.
of data intelligence for successful strategy execution is the key.
■■ Variety - In supply chain, data has emerged in different forms which
BIG DATA IN DEMAND FORECASTING &
don’t fit in traditional applications and
SUPPLY CHAIN PLANNING
models. Structured (transactional data),
Enterprises use forecasting to determine how
unstructured (social data), sensor data
much to produce of each product type, when
11
and where to ship them, thereby improving
■■ Social Media Data As An I nput:
supply chain visibility. Inaccurate forecast
Social media is a platform that enables
causes detrimental effect in supply chain.
enterprises to collect information
Over-forecast results in inventory pile ups
about potential and prospect
and working capital locks. Under-forecast
customers. Thanks to the technological
leads to failure in meeting demand, resulting
advancements that has made tracking
in loss of customer and sales. Hence in today’s
customer data easier. Companies can
volatile market comprised of unpredictable
now track every visit customer makes
shifts in customer demands, improving
to the websites, e-mail exchanged and
accuracy
comments logged across social media
of
forecast
is
of
paramount
importance.
websites. Social media data helps
Data in supply chain planning has
analyze customer pulse and gain insights
mushroomed in terms of volumes, velocity
on forecasting, planning, scheduling of
and variety. Tesco, for instance, generates
supply chain and inventories. Buzz in
more than 1.5 billion new data items every
social networks can be used as an input
month. Wal-Mart’s warehouse handles
for demand forecasting for numerous
some 2.5 petabytes of information which is
benefit. One such use case is, enterprise
roughly equivalent to half of all the letters
can launch a new product to online fans
delivered by the US Postal Service in 2010.
to sense customer acceptance. Based on
According to McKinsey Global institute
the response, inventories and supply
report [3], leveraging on Big data in demand
chain can be planned to direct stocks
forecasting and supply chain planning could
to high buzz locations during launch
increase profit margin by 2-3% in Fast Moving
phase.
Consumer Goods (FMCG) manufacturing v a l u e c h a i n . T h i s u n e a r t h s t r e m e n d o us
■■ P r e d i c t A n d R e s p o n d A p p r o a c h :
opportunity in forecasting and supply chain
Traditional forecasting is done by
planning available for enterprises to capitalize
analyzing historical patterns, considering
on this Big data deluge.
sales inputs and promotional plans to forecast demand and supply chain
MISSING LINKS IN TRADITIONAL
planning. They focus on ‘what happened’
APPROACHES
and work on ‘sense and respond’ strategy.
Enterprises have started realizing the
‘History repeats itself’ is no longer apt
importance of Big data in forecasting and
in todays’ competitive marketplace.
have begun investing in Big data forecasting
Enterprises need to focus on ‘what
tools and technologies to improve their supply
will happen’ and require ‘predict and
chain, production and manufacture planning
respond’ strategy to stay alive in business.
processes. Traditional forecasting tools aren’t
This calls for models and systems capable
adequate enough in handling huge data
of capturing, handling and analyzing
volumes, variety and velocity. Moreover they
huge volume of real-time data generated
are missing out on the following key aspect
from unexpected competitive events,
which improves accuracy of forecasts:
weather patterns, point-of-sales and
12
natural disasters (volcanoes, floods, etc.)
significant financial benefits. Let’s take a deep
and converting them into actionable
dive into each stage of this model and analyze
information for forecasting plans on
what their value-add are in enterprises supply
production, inventory holdings and
chain planning process.
supply chain distribution. Acquire Data: The biggest driver of supply ■■ Optimized Decisions with Simulations:
chain planning is data. Acquiring all the relevant
Traditional decision support systems
data for supply chain planning is the first step
lack flexibility to meet changing data
in this optimized model. It involves three steps
requirements. In real world scenario,
namely data sourcing, data extraction and
supply chain delivery plan changes
cleansing and data representation which make
unexpectedly due to various reasons
data ready for further analysis.
like demand change, revised sales forecast, etc. The model and system
■■ Data Sourcing - Data is available in
should have ability to factor in this and
different forms across multiple sources,
respond quickly to such unplanned
systems and geographies. It contains
events. Decision should be taken only
extensive details of historical demand
after careful analysis of the unplanned
data and other relevant information. For
events impact on other elements of
further analysis it is therefore necessary
supply chain. Traditional approaches
to source required data. Data that are
lack this capability and this necessitates
to be sourced for improving accuracy
a model for performing what-if analysis
of forecast in-addition to transactional
on all possible decisions and selecting
data are:
the optimal one in the Big data context. ■■ Product Promotion data - items, prices, sales
IMPROVING SUPPLY CHAIN VISIBILITY USING BIG DATA Supply chain doesn’t lack data – what’s missing
■■ Launch data - items to be ramped up
is a suitable model to convert this huge diverse
or down
raw data into actionable information so that enterprises can make critical business decisions
■■ Inventory data - stock in warehouse
for efficient supply chain planning. A 3-stage optimized value model helps to overcome
■■ Customer data - purchase history,
the challenges posed by Big data in supply
social media data
chain planning and demand forecasting. It bridges the existing gaps in traditional Big
■■ Transportation data - GPS and
data approaches and offers a perspective
logistics data.
to unlock the value from growing Big data torrent. Designing and building an optimized
Enterprises should adopt appropriate
Big data model for supply chain planning is a
Big data systems that are capable of handling
complex task but successful execution leads to
such huge data volumes, variety and velocity.
13
■■ Data Extraction and Cleansing - Data
selection of Big data technique depends on the
sources are available in different forms
business scenario and enterprise objectives.
from structured (transactional data) to
Incompatible data formats make value creation
un-structured (social media, images,
from Big data a complex task and this calls for
sensor data, etc.) and they are not in
innovation in techniques to unlock business
analysis-friendly formats. Also due
value out of the growing Big data torrent. The
to large volume of heterogeneous
proposed model adopts optimization technique
data there is high probability of
to generate insights out of this voluminous and
inconsistencies and data errors while
diverse Big dataset.
sourcing. The sourced data should be expressed in structured form for supply
■■ Optimization in Big data analysis -
chain planning. Moreover analyzing
Manufacturers have started synchronizing
inaccurate and untimely data leads to
forecasting with production cycles,
erroneous non-optimal results. High
so accuracy of forecasting plays a
quality and comprehensive data is a
crucial role in their success. Adoption
valuable asset and appropriate data
of optimization technique in Big data
cleansing mechanisms should be in
analysis creates a new perspective and
place for maintaining the quality of Big
it helps in improving the accuracy of
data. Choice of Big data tools for data
demand forecasting and supply chain
cleansing and enrichment plays a crucial
planning. Analyzing the impact of
role in supply chain planning.
promotions on one specific product for demand forecasting appears to be an easy
■■ Data Representation – Database design
task. But real life scenarios comprises
for such huge data volume is a herculean
of huge army of products with factors
task and poses some serious performance
affecting their demand varying for
issues if not executed properly. Data
every product and location making it
representation plays a key role in Big
difficult for traditional techniques in
data analysis. There are numerous ways
data analysis.
to store data and each design has its own set of advantages and drawbacks.
Optimization technique has several
Selection of appropriate database design
capabilities which make it an ideal choice for
and executing appropriate design
data analysis in such scenarios. Firstly, this
favoring business objectives reduces the
technique is designed for analyzing and drawing
efforts in reaping benefits out of Big data
insights for highly complex system with huge
analysis in supply chain planning.
data volumes, multiple constraints and factors to be accounted for. Secondly, supply chain
Analyze Data: The next stage is analyzing
planning has number of enterprise objectives
cleansed data and capturing value for forecasting
associated with it like cost reduction, demand
and supply chain planning. There is plethora of
fulfillment, etc. The impact of each of these
Big data techniques available in market for
objective measures on enterprises profitability
forecasting and supply chain planning. The
can be easily analyzed using optimization
14
Data Sourcing Data Extraction & Cleansing Data Representation ACQUIRE OPTIMIZATION TECHNIQUE INPUT
GOALS
CONSTRAINTS
Min (Cost)
Capacity constraint
Max (Profit) Max (Demand Coverage)
Demand Coverage Constraint
Route Constraint OUTPUT Inventory Plan Demand Plan Logistics Plan
ANALYZE Scenario Management Performance Trackers KPI Dashboards Actual Vs. Planned
Multi User Collaboration
Build Compare
Simulate
ACHIEVE Figure 2: Optimization Model for Improving Supply Chain Visibility – II
Source: Infosys Research
technique. Flexibility of optimization technique
incorporate the entire constraints specific to
is another benefit that makes it suitable for Big
the supply chain planning in the model; some
data analysis to uncover new data connections
of the constraints are minimum inventory
and turn them into insights.
in warehouse, capacity constraint, route
Optimization model comprises of
constraint, demand coverage constraint, etc;
four components, viz., (i) input – consistent,
and (iv) output – results based on input, goals
real-time, quality data which is sourced,
and constraints defined in the model that can
cleansed and integrated becomes the input
be used for strategy executions. The result can
of the optimization model; (ii) goals – the
be demand plan, inventory plan, production
model should take into consideration all
plan, logistics plan, etc.
the goals pertaining to the forecasting and supply chain planning like minimizing cost,
■■ Choice of Algorithm: One of the key
maximizing demand coverage, maximizing
differentiators in supply chain planning
profits, etc. (iii) constraints – the model should
is the algorithm used in modeling.
15
Optimization problems have numerous
when business changes. This model
possible solutions and the algorithm
builds a collaborative system with
should have the capability to fine-tune
capability of supporting inputs from
itself for achieving optimal solutions.
multiple users and incorporating in its decision making process
Achieve Business Objective: The final stage in this model is achieving business objectives
■■ P e r f o r m a n c e T r a c k e r – D e m a n d
through demand forecasting and supply
forecasting and supply chain planning
chain planning. It involves three steps which
does not follow build-model-execute
facilitates enterprise in supply chain decisions.
approach, it requires significant continuous effort. Frequent changes in
■■ Scenario Management – Business events
the inputs and business rules necessitate
are difficult to predict and most of the
monitoring of data, model and algorithm
times deviate from their standard paths
performance. Actual and planned results
resulting unexpected behaviors and
are to be compared regularly and steps
events. This makes it difficult for planning
are to be taken to minimize the deviations
and optimizing during uncertain times.
in accuracy. KPI is to be defined and
Scenario management is the approach
dashboard should be constantly
to overcome such uncertain situations.
monitored for model performances.
Scenario management facilitates creating business scenarios, comparing multiple
KEY BENEFITS
different scenarios, analyze and assessing
Enterprises can accrue lot of benefit by adopting
its impact before making decisions. This
this 3-stage model for Big data analysis. Some of
capability helps to balance conflicting
them are detailed below:
KPIs and arrive at an optimal solution matching business needs.
Improves Accuracy of Forecast: One of the key objectives of forecasting is profit
■■ Multi User Collaboration – Optimization
maximization. This model adopts effective data
model in real business case comprises
sourcing, cleansing and integration systems and
of highly complex data sets and models
makes data ready for forecasting. Inclusion of
which requires support from an army
social media data, promotional data, weather
of analysts and determines its effects
predictions, seasonality’s in addition to
on enterprises goals. Combinations
historical demand and sales histories adds value
of technical and domain experts are
and improves forecasting accuracy. Moreover
required to obtain optimal results.
optimization technique for Big data analysis
To achieve near accurate forecasts
reduces forecasting errors to a great extent.
and supply chain optimization the model should support multi-user
Continuous Improvement: Acquire-Analyze-
collaboration so that multiple users can
Achieve model is not a hard-wired model. It
collaboratively produce optimal plans
allows flexibility to fine tune and supports
and schedules and re-optimize as and
what-if analysis. Multiple scenarios can be
16
created, compared and simulated to identify
no option other than reducing cost in their
the impact of change on the supply chain and
operational executions. Adopting effective
demand forecasting prior to the making any
and efficient supply chain planning and
decisions. Also it enables enterprise to define,
optimization techniques to match customer
track and monitor KPIs from time to time
expectations with its offerings is the key
resulting in continuous process improvements.
to corporate success. To attain operational excellence and sustainable advantage, it is
Better Inventory Management: Inventory data
necessary for the enterprise to build innovative
along with weather predictions, history of sales
models and frameworks leveraging the power
and seasonality is considered as an input to
of Big data.
the model for forecasting and planning supply
Optimized value model on Big data
chain. This approach minimizes incidents of
offers a unique way of demand forecasting
out-of-stock or over-stocks across different
and supply chain optimization through
warehouses. Optimal plan for inventory
collaboration, scenario management and
movement is forecasted and appropriate stocks
performance management. This model on
are maintained at each warehouse to meet the
continuous improvement opens up doors for big
upcoming demand. To a great extent this will
opportunities for the next generation of demand
reduce loss of sales and business due to stock-
forecasting and supply chain optimization.
outs and leads to better inventory management. REFERENCES Logistic Optimization: Constant sourcing
1. I D C - P r e s s R e l e a s e ( 2 0 1 2 ) , I D C
and continuous analysis of transportation
Releases First Worldwide Big data
data (GPS and other logistics data) and using
Technology and Services Market
them for demand forecasting and supply chain
Forecast, Shows Big data as the Next
planning through optimization techniques
Essential Capability and a Foundation
helps in improving distribution management.
for the Intelligent Economy. Available
Moreover optimization of logistics improves
at http://www.idc.com/getdoc.
fuel efficiency and efficient routing of vehicles
jsp?containerId=prUS23355112.
resulting in operational excellence and better
2. McKinsey Global Institute (2011), Big
supply chain visibility.
data: The next frontier for innovation, competition, and productivity. Available
CONCLUSIONS
at http://www.mckinsey.com/~/media/
As rapid penetration of information technology
McKinsey/dotcom/Insights%20and%20
in supply chain planning continues, the amount
pubs/MGI/Research/Technology%20
of data that can be captured, stored and analyzed
and%20Innovation/Big%20Data/MGI_
has increased manifold. The challenge is to
big_data_full_report.ashx.
derive value out of these large volumes of data
3. Furio, S.,
Andres, C., Lozano, S.,
by unlocking financial benefits in congruent
Adenso-Diaz, B., (2009), Mathematical
with the enterprises’ business objectives.
model to optimize land empty container
Competitive pressures and customers
movements. Available at http://
‘more for less’ attitude has left enterprises with
www.fundacion.valenciaport.com/
17
Articles/doc/presentations/HMS2009_
at: http://loci.cs.utk.edu/ibp/files/
Paperid_27_Furio.aspx.
pdf/LogisticalNetworking.pdf.
4. Stojkovića, G., Soumisb, F., Desrosiersc,
6. Lasschuit, W., Thijssen, N., (2004),
J., Solomon, M. (2001), An optimization
Supporting supply chain planning and
model for a real-time flight scheduling
scheduling decisions in the oil and
problem. Available at http://www.
chemical industry, Computers and
sciencedirect.com/science/article/pii/
Chemical Engineering, issue 28, pp. 863–
S0965856401000398.
870. Available at http://www.aimms.
5. Beck, M., Moore, T., Plank, J., Swany, M.
com/aimms/download/case_studies/
(2000), Logistical Networking. Available
shell_elsevier_article.pdf.
18
Infosys Labs Briefings VOL 11 NO 1 2013
Retail Industry – Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu
Gain better insight into customer dynamics through Big Data analytics
R
etail industry is going through a major
shared by customers. The more effective
paradigm shift. The past decade has seen
retailers can tap into these behavioral and social
unprecedented churn in retail industry virtually
reservoirs of data to model purchasing behaviors
changing the landscape. Erstwhile marquee
and trends of their current and prospective
brands from traditional retailing side have
customers. Such data can also provide the
ceded space to start-ups and new business
retailers with predictive intelligence, which
models.
if leveraged effectively can create enough
The key driver of this change is a
mindshare, that the sale is completed even
confluence of technological, sociological and
before the conscious decision to purchase is
customer behavioral trends creating this
taken.
strategic infection point in retailing ecology.
This move to a feedback economy
Trends like emergence of internet as major
where retailers can have 360 degree view of
retailing channel, social platforms going
the customer thought process across the selling
mainstream, pervasive retailing and emergence
cycle is a paradigm shift for retail industry –
of digital customer has presented a major
from retailer driving sales to retailer engaging
challenge to traditional retailers and retailing
customer across the sales and support cycle.
models.
Every aspect of retailing from assortment/
On the other hand, these trends have
allocation planning, marketing/promotions to
also enabled opportunities for retailers to better
customer interactions has to take the evolving
understand customer dynamics. For the first
consumer trends into consideration.
time, retailers have access to unprecedented
The implication from business
amount of publicly available information on
perspective is that retailers have to better
customer behavior and trends; voluntarily
understand customer dynamics and align
19
Implicit Guidance & Control
Analysis and Synthesis
Genetic Heritage
Unfolding Interaction with Enviroment
Forward
Observation New Information
Previous experiences
Feed
Cultural Transactions
Feed
Outside Information
Act
Decision (Hypothesis)
Feed
Implicit Guidance & Control
Decide
Forward
Unfolding Circ*mstances
Orient
Forward
Observe
Action (Test)
Unfolding Interaction with Enviroment
Feedback Feedback Feedback
Figure 1: OODA loop Source: Reference [5]
Source: Reference [5]
business processes effectively with these
TOWARDS A FEEDBACK ECONOMY
trends. In addition, this implies that cycle
Customer dynamics refers to customer-
times will be shorter and businesses have to be
business relationships that describe the ongoing
more tactical in their promotions and offerings.
interchange of information and transactions
Retailers who can ride this wave will be better
between customers and organizations that
able to address demand and command higher
goes beyond the transactional nature of
margins for the products and services. Failing
the interaction to look at emotions, intent
this, retailers will be left with low-margin
and desires. Retailers can create significant
pricing/commodity space.
competitive differentiation by understanding
From information technology
the customer’s true intent in a way that also
perspective, the key challenge is that nature
supports the business’ intents [1, 2, 3, 4].
of this information with respect to lifecycle,
John Boyd a colonel military strategist
velocity, heterogeneousness of the sources
in the US air force developed the OODA loop
and volume is radically different from what
(Observe, Orient, Decide and Act) which he
traditional systems handle. Also, there are
used for combative operations. Today’s business
overarching concerns like that of data privacy,
environment is nothing different Retailers
compliance and regulatory changes that need
are battling to get customer into their shops
to be internalized with internal processes. The
(physical or net-front) and convert their visits
key is to manage lifecycle of this Big data and
to sales. And understanding customer dynamics
effectively integrate with the organizational
play a key role in this effort. The OODA loop
system and to derive actionable information.
explains the crux of the feedback economy.
20
In a feedback economy, there is constant feedback
cannot be directly integrated with traditional
to the system from every phase of its execution.
analytics tool leading to challenges on how the
Along with this, the organization should
data can be assimilated with backend decision
observe the external environment, unfolding
making systems and analyzed.
circ*mstances and customer interactions. These
In the assimilate/analyze phase, retailer
inputs are analyzed and action is taken based
must decide which data is of use and define
on these inputs. This cycle of adaptation and
rules for filtering the unwanted data. Filtering
optimization makes the organization more
should be done with utmost care, as there are
efficient and effective on an ongoing basis.
cases where indirect inferences are possible. The
Leveraging this feedback loop is pivotal
data available to the retailer after the acquisition
in having a proper understanding of customer
phase would be of multiple formats and they
needs and wants and the evolving trends. In
have to be cleaned and harmonized with the
today’s environment, this means acquiring
backend platforms.
data from heterogeneous sources viz., in-
Cleaned up data is then mined for
store transaction history, web analytics, etc.
actionable insights. Actionize is a phase where the
This creates a huge volume of data that has
insights gathered from analyze phase is converted
to be analyzed to get the required actionable
to actionable business decisions by the retailer.
insights
The response i.e., business outcome is fed back to the system so that the system can
BIG DATA LIFECYCLE: ACQUIRE-
self-tune on an ongoing basis to result in a self-
ANALYZE-ACTIONIZE
adaptive system that leverages Big data and
The lifecycle of Big data can be visualized as a
feedback loops to offer business insight more
three-phased approach resulting in continuous
customized than what would be traditionally
optimization. The first step in moving towards
possible. It is imperative to understand that
feedback economy is to acquire data. In this
this feedback cycle is an ongoing process and
case, retailer should look into the macro and
not to be considered as a one stop solution for
micro environment trends, consumer behavior
the analytics needs of a retailer.
- their likes, emotions, etc. Data from electronic channels like blogs, social networking sites
ACQUIRE: FOLLOWING CUSTOMER
and twitter will give the retailer a humongous
FOOTPRINTS
amount of data regarding the consumer. These
To understand the customer, retailers have to
feeds help the retailer understand consumer
leverage every interaction with the customer
dynamics and give more insights into her
and tap into the source of customer insight.
buying patterns.
Traditionally, retailers have relied primarily on
The key advantage of plugging into these
in-store customer interactions and associated
disparate sources is the sheer information one
transaction data along with specialized campaigns
can gather about customer – both individually
like opinion polls to gain better insight into
and in aggregate. On other hand, Big data is
customer dynamics. While this interaction looks
materially different from the data the retailers
limited, a recent incident shows how powerful
are used to handling. Most of the data is
customer sales history can be leveraged to gain
unstructured (from blogs, twitter feeds, etc.) and
predictive intelligence on customer needs.
21
“A father of a teenage girl called in a
can result in generating data that is beyond
major North American retailer to complain that
what user originally consented to; potentially
the retailer had mailed coupons for child care
resulting in liability for the retailer. Given that
products addressed to his underage daughter.
most of this information is accessible globally,
Few days later, the same father called in and
retailers should ensure compliance with local
apologized that his daughter was indeed
regulations (EU data /privacy protection
pregnant and he was not aware of it earlier” [6].
regulations, HIPAA for US medical data, etc.)
Surprisingly, by all indications, only in-
where they operate.
store purchase data was mined by the retailer in this scenario to identify the customer need
ANALYZE - INSIGHTS (LEADS)
which in this case is that of childcare products.
TO INNOVATION
To exploit the power of next generation
Analyst Doug Laney defined data growth
of analytics retailers must plug into data from
challenges and opportunities as being three-
non-traditional sources like social sites, twitter
dimensional, i.e. increasing volume (amount of
feeds, environment sensor networks, etc. to
data), velocity (speed of data in and out), and
have better insight into customer needs. Most
variety (range of data types and sources)[9].
major retailers now have multiple channels –
The key to acquire Big data is to handle
brick/mortar store, online store, mobile apps,
these dimensions while assimilating these
etc. Each of these touch points not only acts
aforementioned external sources of data. To
as a sales channel but can also generate data
understand how Big data analytics can enrich
on customer needs and wants. Coupling this
and enhance a typical retail process – allocation
information with other repository like Facebook
planning – let’s look at the allocation planning
posts, twitter feeds (i.e., sentiment analysis) and
case study for a major North American apparel
web analytics retailers have the opportunity to
retailer.
track customer footprints both in and outside
The forecasting engine used for planning
the store and to customize their offerings and
process uses statistical algorithms to determine
interactions with customer.
allocation quantities. Key inputs to forecasting
Traditionally retailers have dealt with
engine are sales history and current performance
voluminous data. For example, Wal-Mart logs
of store. In addition, adjustments are also
more than 2.5 petabytes of information about
based on parameters like Promotional events
customer transactions every hour, equivalent
(including markdown), current stock levels,
to 167 times the books in the Library of
back orders to determine the inventory that
Congress [7].
needs to be shipped to particular store.
However, the nature of Big data is
While this is fairly in line with industry
materially different from traditional transaction
standard for allocation forecasting, Big data
data and this must be considered while data
can enrich this process by including additional
planning is done. Further, while data is
parameters that can impact demand. For e.g., a
available readily, the legality and compliance
news piece on a town’s go-green initiative or no
aspect of gathering and using data is additional
plastic day can be taken as additional adjustment
aspect that needs to be considered. Further,
parameter for non-green items in that area.
integrating information from multiple sources
Similarly, a weather forecast on warm front in
22
an area can automatically trigger reduction of
of data in rapid speed. Once data is massaged for
stocks of warm-clothing for stores there.
downstream systems, big analytics tools are used
A high-level logical view of Big data
to analyze. Based on business needs, real-time or
implementation is explained below to further
offline data processing/analytics can be used. In
understanding on how Big data can be assimilated
real-life scenarios, both these approaches are used
with traditional data sources. The data feeds
based on situation and need.
for the implementation comes from various
Proper analysis needs data not just
structured sources like forums, feedback forms,
from consumer insight sources but also from
rating sites and unstructured source like social
transactional data history and consumer
web, etc. as well as semi-structured data from
profiles.
emails, word documents, etc. This is a veritable data feast thrown compared to traditional
ACTIONIZE – BIG DATA TO BIG IDEAS
systems but it is important that we diet on such
This is the key part of the Big data cycle. Even
data and use only those feeds that create optimum
the best data cannot be substituted for timely
value. This is done through synergy of business
action. The technology and functional stacker
knowledge and processes specific to retailer and
will facilitate retailer getting proper insight
the industry segment the retailer operates in and
into key customer intent on purchase – what,
set of tools specialized in analyzing huge volume
where, why and at what price. By knowing this,
Best Sellers in Tablet PCs
Most Wished For in Tablet PCs
1. Kindle Fire HD 7”, Dolby Audio, Dual-Band Wi-Fi, 32GB
1. Kindle Fire HD 8.9”, 4G LTE Wireless, Dolby Audio, Dual-Band Wi-Fi, 32GB
2. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 16GB
2. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 32GB
3. Samsung Galaxy Tab 2 (7-Inch,Wi-Fi)
3. Kindle Fire, Full Color 7” Multi-touch Display, Wi-Fi
4. Samsung Galaxy Tab 2 (10.1-Inch, Wi-Fi)
4. Kindle Fire HD 7”, Dolby Audio, Dual-Band Wi-Fi, 32 GB
5. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 32 GB
5. Samsung Galaxy Tab 2 (7-Inch, Wi-Fi)
Figure 2: Correlation between Customer Ratings and Sales
Source: Reference [12]
23
the retailer can customize the 4Ps (product,
marketing [12] than from traditional
pricing, promotions and place) to create enough
advertising channels.
mindshare from customer perspective that sales become inevitable [10].
■■ Compliance: Governmental regulations
For example, a cursory look at random
and compliance requirements are
product category (tablet) in an online retailer
mandatory to avoid liability as co-
site shows the strong correlation between
mingling data from disparate sources
customer ratings and sales, i.e., 4 out of 6 best
can result in generation of personal data
user-rated products are in the top five in sales –
beyond the scope of the original user’s
a 60% correlation even when other parameters
intent. While data is available globally,
like brand, price, release date are not taken into
the use has to comply with local law of
consideration [Fig. 2] 12. The retailer knowing
the land and ensure it is done keeping in
the customer ratings can offer promotions
mind customer’s sensibilities.
that can tip the balance between sales and lost opportunity. While this example may not be the
■■ People, Process and Organizational
rule, the key to analysis and actionizing the data
Dynamics: The move to feedback economy
is to correlate the importance of user feedback
requires different organizational
data and concomitant sales.
mindset and processes. Decision making will need to be more bottom-up and
BIG DATA OPPORTUNITIES
collaborative. Retailers need to engage
The implication of Big data analytics on major
customer to ensure the feedback loop is
retailing processes will be along the following
in place. Further, Big data being cross-
areas.
functional, needs the active participation and coordination between various
■■ Identifying the Product Mix: The
departments in the organization; hence
assortment and allocation will need to
managing organizational dynamics is the
take into consideration the evolving user
key consideration.
trends identified from Big data analytics to ensure the offering matches the market
■■ B e t t e r C u s t o m e r E x p e r i e n c e :
needs. Allocation planning especially
Organizations can improve the overall
has to be tactical with shorter lead times.
customer experience by providing updates services and thereby eliminating
■■ Promotions and Pricing: Retailers have
surprises. For instance Big data solutions
to move from generic pricing strategies
can be used to pro-actively inform
to customized user specific.
customers of expected shipment delays based on traffic data, climate and other
■■ C o m m u n i c a t i o n w i t h C u s t o m e r :
external factors.
Advertising will move from mass media to personalized communication; from
BIG DATA ADOPTION STRATEGY
one way to two-way communication.
Presented below is a perspective on how to
Retailers will gain more from viral
adopt a Big data solution within the enterprise.
24
Define Requirements, Scope and Mandate:
Key Player: Data Analyst
Define mandate and objective in terms of what is the required from Big data solution. A guiding
Strategy to Actionize the Insights:
factor to identify the requirements would be the
Business should create process that would take
prioritized list of business strategies. As part of
these inferences as inputs to decision making.
initiation, it is important to also identify the goal
Stakeholders in decision making should be
and KPIs that vindicates the usage of Big data.
identified and actionable inferences have to be communicated at the right time. Speed is critical
Key Player: Business
to the success of Big data.
Choosing the Right Data Sources:
Key Player: Business
Once the requirement and scope is defined, the IT department has to identify the various
Measuring the Business Benefits:
feeds that would fetch the relevant data. These
The success of the Big data initiative depends on
feeds would be structured, semi structured and
the value it creates to the organization and its
unstructured. The source could be internal or
decision making body. It should also be noted
external. For internal sources, the policies and
that unlike other initiatives, Big data initiatives
processes should be defined to enable friction
are usually continuous process in search of the
less flow of data.
best results. Organizations should be in tune to this understanding to derive the best results. However, it is important that a goal is set and
Key Players: IT and Business
measured to track the initiative and ensure its movement in the right direction.
Choosing the Required Tools and Technologies: After deciding upon the sources of data that would feed the system, the right tools and
Key Players: IT and Business
technology should be identified and aligned with business needs. Key areas are capturing the
CONCLUSION
data, tools and rules to clean the data, identify
The move to feedback economy presents an
tools for real-time and offline analytic, identify
inevitable paradigm shift for the retail industry.
storage and other infrastructure needs.
Big data as the enabling technology will play key role in this transformation. As ever, business needs will continue to drive technology process
Key Player: IT
and solution. However, given the criticality of Creating Inferences from Insights:
Big data, organizations will need to treat Big
One of the key factors to a successful Big
data as an existential strategy and make the right
data implementation is to have a pool of
investment to ensure they can ride the wave.
talented data analyst who can create proper inferences from the insights and facilitate
REFERENCES
build and definition of new analytic models.
1. Customer dynamics. Available at http://
These models help in probing the data and
en.wikipedia.org/wiki/Customer_
understand the insights.
dynamics.
25
2. Davenport, T. and. Harris, G., (2007),
9. Gartner Says Solving ‘Big data’ Challenge
Competing on Analytics, Harvard
Involves More Than Just Managing
Business School Publishing.
Volumes of Data (2011). http://www.
3. D e B o r d e , M . , ( 2 0 0 6 ) , D o Y o u r
gartner.com/it/page.jsp?id=1731916.
Organizational Dynamics Determine Your
10. G e n s , F . ( 2 0 1 2 ) . I D C P r e d i c t i o n
Operational Success?, The O and P Edge.
2012: Competing for 2020. Available
4. Lemon, K., Barnett, T., White, Russell S.
at http://cdn.idc.com/research/
Winer, Dynamic Customer Relationship
Predictions12/Main/downloads/
Management: Incorporating Future
IDCTOP10Predictions2012.pdf.
Considerations into the Service Retention
11. Bhasin, H. 4Ps of marketing. Available
Decision, Journal of Marketing.
at http://www.marketing91.com/
5. Boyd, J. (September 3, 1976). OODA
marketing-mix-4-ps-marketing/.
loop, In Destruction and Creation.
12. Amazon US site / tablets category (2012).
Available at http://en.wikipedia.org/
Available at http://www.amazon.com/
wiki/OODA_loop.
gp/top-rated/electronics/3063224011/
6. Doyne, S. (2012), Should Companies
ref=zg_bs_tab_t_tr?pf_rd_
Collect Information About You?, NY
p=1374969722&pf_rd_s=right-
Times. Available at http://learning.
8&pf_rd_t=2101&pf_rd_i=list&pf_
blogs.nytimes.com/2012/02/21/
rd_m=ATVPDKIKX0DER&pf_rd_
should-companies-collect-information-
r=14YWR6HBVR6XAS7WD2GG.
about-you/.
13. Godwin, G. (2008) Viral marketing. Available
7. Data, data everywhere (2010), The
at http://sethgodin.typepad.com/seths_
Economist. Available at http://www.
blog/2008/12/what-is-viral-m.html.
economist.com/node/15557443.
14. Wang, R. (2012), Monday’s Musings:
8. IDC Digital Universe (2011). Available
Beyond The Three V’s of Big data –
at http://chucksblog.emc.com/
Viscosity and Virality , http://blog.
chucks_blog/2011/06/2011-idc-digital-
softwareinsider.org/2012/02/27/
universe-study-big-data-is-here-now-
mondays-musings-beyond-the-three-
what.html.
vs-of-big-data-viscosity-and-virality/
26
Infosys Labs Briefings VOL 11 NO 1 2013
Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD
Communication Service Providers need to leverage the 3M Framework with a holistic 5C process to extract Big Data value (BDV)
I
n today’s hyper-competitive experience
customers who can now choose from multiple
economy, communication service providers
channels to conduct business interactions.
(CSPs) recognize that product and price alone
A recent industry research indicates that
will not differentiate their business and brand.
some 90% of today’s consumers in the US
Since brand loyalty, retention and long-term
and West Europe interact across multiple
profitability are now so closely aligned with
channels, representing a moving target that
customer experience, the ability to understand
makes achieving a full view of the customer
customers, spot changes in their behavior
that much more challenging .
and adapt quickly to new consumer needs is
To compound this trend, the always-on
fundamental to the success of the consumer
digital customers continuously create more data in
driven Communication Service Industry.
various types, from many more touch points with
The increasingly sophisticated digital
more interaction options. CSPs encounter “Big
consumers demand more personalized
data phenomenon” by accumulating significant
services through the channel of their
amounts of customer related information such
choice. In fact, the internet, mobile and
as purchase patterns, activities on the website,
particularly, the rise of social media in the
from mobile, social media or interactions with
past 5 years have empowered consumers
the network and call centre.
more than ever before. There is a growing
Such Big data phenomenon presents
challenge for CSPs that are contending with
CSPs with challenges along 3V dimensions
an increasingly scattered relationship with
(Fig. 1), viz.,
27
CSPs of all sizes have learned the hard way that
Mobile
it is very difficult to take full advantage of all of the customer interactions in Big data if they do
Web
Variety
not know what their customers are demanding
Call centre
or what their relative value to the business is.
Value Volume
Store
Even some CSPs that do segment their customers
Velocity
with the assistance of customer relationship management (CRM) system struggle to take
Social
complete advantage of that segmentation in
Figure 1: Big Data in 3Vs is accumulated from Multiple Channels Source: Infosys Research
developing a real-time value strategy. In hypersophisticated interaction patterns throughout their journey spanning marketing, research, order, service and retention, Big data sheds
■■ Large Volume: Recent industry research
shining light to expose treasured customer
shows that the amount of data that
intelligence along aspects of 4Is viz., interest,
the CSP has to manage with consumer
insight, interaction and intelligence.
transaction and interaction has doubled in the past three years, and its growth is also
■■ Interest and Insight: Customers offer
in acceleration to double the size again
their attention for interest and share
in the next two years, much of it coming
their insights. They visit a web site,
from new sources including blogs, social
make a call, or access a retail store, share
media, internet search, and networks [7].
view on social media because they want something from CSP at that moment
■■ Broad Variety: The type, form and
– information about a product or help
format of data are created in a broad
with a problem. These interactions
variety. Data is created from multiple
present an opportunity for the CSP to
channels such as online, call centre,
communicate with a customer who is
stores and social media including
engaged by choice and ready to share
Facebook, Twitter and other social media
information regarding her personalized
platforms. It presents itself in a variety of
wants and needs.
types, comprising structured data form transaction, semi-structure data from call
■■ Interaction and Intelligence: It is typically
records and unstructured data in multi-
crucial for CSPs to target offerings to
media forms from social interactions
particular customer segments based on the intelligence of customer data. The
■■ Rapidly Changing Velocity: The always
success of these real time interactions –
on digital consumers create change
whether through online, mobile, social
dynamics of data in the speed of light.
media, or other channels depends to a
They equally demand fast response from
great extent on the CSP’s understanding
CSPs to satisfy their personalized needs
of the customer’s wants and needs at the
in real time.
time of the interaction.
28
Therefore, alongside managing and securing Big data in 3V dimensions, CSPs are facing a
Correlate
fundamental challenge on how to explore and harness Big data Value (BDV).
Product
Converge
Promotion Customer
A HOLISTIC 5C PROCESS TO HARNESS BDV
Control Order
Service
Rising to the challenges and leveraging on the
Collect
Collaborate
opportunity in Big data, CSPs need to harness BDV with predictive models to provide deeper
Figure 2: Harness BDV with a Holistic 5C Process Source: Infosys Research
insight into customer intelligence from profiles, behaviours and preferences that are hidden in Big data of vast volume and broad variety, and to deliver superior personalized experience
The holistic 5C process will help CSPs to
with fast velocity in real time throughout entire
aggregate the whole interaction with a customer
customer journey.
across time and channels, support with large
In the past decade, most CSPs have
volume and broad variety of data including
invested significant amount of efforts in the
promotion, product, order and services, define
implementation of complex CRM systems to
interactions with that of customer’s preferences.
manage customer experience. While those CRM
The context of the customer’s relationship with
systems bring efficiency in helping CSPs to
the CSP, and actual and potential value that she
deliver on “what” to do in managing historical
derives, in particular, determine the likelihood
transactions, they lack the crucial capability
that she consumer will take particular actions
of defining “how” to act in time with the most
based on real time intelligence. Big data can
relevant interaction to maximize the value for
help the CSP correlate the customer’s needs with
the customer.
product, promotion, order, service and deliver
CSPs now need to look beyond what
the right offer at the right time in the appropriate
CRM has to offer and dive deeper to cover
context that she is most likely to respond to.
“how” to do things right for the customer by capturing customers’ subjective sentiment
AN OVERARCHING 3M FRAMEWORK
in a particular interaction, resultant insight
TO EXTRACT BDV
into predication on what a customers demand
To execute a holistic 5C process for Big data,
from CSPs and trigger proactive action to
CSPs need to implement an overarching
satisfy their needs, which is more likely
framework that integrates the various pools
to lead to customer delight and ultimate
of customer related data residing in CSPs
revenues.
enterprise systems, create an actionable customer profile, deliver insight based on
■■ To do so, CSPs needs to execute a
that profile in real time customer interaction
holistic 5C process, i.e., collect, converge,
event and effectively match sales and service
correlate, collaborate and control, in
resources to take proactive actions, so as to
extracting BDV (Fig. 2).
monetize ultimate value on the fly.
29
The overarching framework needs to incorporate
The 3M framework needs to be based on an
3M modules, i.e. Model, Monitor and Mobilize
event-driven architecture (EDA) incorporating Enterprise Service Bus (ESB) and Business
■■ Model Profile: It models customer
Process Management (BPM) and should be
profile based on all the transactions that
application and technology agnostic. It needs
helps CSPs gain insight at the individual-
to interact with multiple channels using events;
customer level. Such a profile requires
match patterns of a set of events with pre-defined
not only integration of all customer
policies, rules, and analytical models; deliver a set
facing systems and enterprise systems,
of automations to fulfil personalized experience
but integration with all the customer
that spans the complete customer lifecycle.
interactions such as email, mobile, online and social in enterprise systems such as
Furthermore, the 3M framework needs to
OMS, CMS, IMS and ERP in parallel with
be supported with key high-level functional
CRM paradigm, and model an actionable
components, which include:
customer profile to be able to effectively deploy resources for a distinct customer
■■ Customer Intelligence from Big data:
experience.
A typical implementation of customer inte llige nc e fro m Big data is the
■■ Monitor Pattern: It monitors customer
combination of Data Warehouse and real
interaction events from multiple touch
time customer intelligence analytics. It
points in real time, dynamically senses
requires aggregation of customer and
and triggers matching patterns of events
product data from CSP’s various data
with the defined policies and set models,
sources in BSS/OSS, leveraging CSP’s
and makes suitable recommendations
existing investments with data models,
and offers at right time through an
workflows, decision tables, user interface,
appropriate channel. It enables CSPs
etc. It also integrates with the key modules
to quickly respond to changes in the
in CSP’s enterprise landscape, covering:
marketplace—a seasonal change in demand, for example—and bundle
■■ C u s t o m e r
Management:
offerings that will appeal to a particular
A complete customer relationship
customer, across a particular channel, at
management solution combines a
a particular time.
360 degree view of the customer with intelligent guidance and
■ ■ Mobilize Process: It mobilizes a set
seamless back-office integration
of automations that allows customers
to increase first contact resolution
enjoy the personalized engaging
and operational efficiency.
journey in real time that spans outbound and inbound communications, sales,
■■ O f f e r
Management:
orders, service and help intervention,
CSP-specific specialization
and fulfil customer’s next immediate
and re-use capabilities that
demand.
define new services, products,
30
bundles, fulfilment processes
insights to enable intelligently
and dependencies and rapidly
driven marketing campaigns
capitalize on new market
to develop, define and refine
opportunities and improve
marketing messages and target
customer experience.
customer with a more effective planand meet customers at the
■■ O r d e r
Management:
touch points of their choosing
The configurable best practices for
through optimized display and
creating and maintaining holistic
search results while generating
order journey that is critical to the
demand via automated email
success of such product-intensive
creation, delivery and results
functions as account opening, quote
tracking.
generation, ordering, contract generation, product fulfilment and
■■ R e t e n t i o n
service delivery.
Management:
Customers offer their attention, either intrusively or non-
■■ Service Management: Case based
intrusively to look for the products
work automation and a complete
and services that meet their needs
view of each case enables an
through the channel of their
effective management of every
choices. It dynamically captures
case throughout its lifecycle.
consumer data from highly active and relevant outlets such as social
■■ Event Driven Process Automation:
media, websites and other social
A dynamic process automation engine
sources and enables CSPs to
empowered with EDA leverages the
quickly respond to customer needs
context of the interaction to orchestrate
and proactively deliver relevant
the flow of activities, guiding customer
offers for upgrades and product
service representatives (CSRs) and self-
bundles that take into account each
service customers through every step in
customer’s personal preference.
their inbound and outbound interactions, in particular for Campaign Management
■■ Experience Personalization: It provides
and Retention Management.
the customer with personalized, relevant experience, enabled from business process
■■ Campaign Management: Outbound
automation that connects people, processes
interactions are typically used
and systems in real time and eliminates
to target products and services
product, process and channel silos. It helps
to particular customer segments
CSPs extend predictive targeting beyond
based on analysis of customer data
basic cross-sells to automate more of their
through appropriate channels.
cross-channel strategies and gain valuable
It uncovers relevant, timely and
insights from hidden, consuming and
actionable consumer and network
interaction patterns.
31
Overall, the 3M framework will empower BDV
store to contact centre to Web to social
solution for CSP to execute on the real-time
media, it helps CSPs deliver a new
decision that aligns individual needs with
standard of branded, consistent customer
business objectives and dynamically fulfils the
experiences that build deeper, more
next best action or offer that will increase the
profitable and lasting relationships. It
value of each personalized interaction.
enables CSPs to maximize productivity by handling customer interactions as fast as possible in the most profitable channel.
BDV IN ACTION- CUSTOMER EXPERIENCE OPTIMIZATION
At every point in the customer lifecycle, from By implementing the proposed BDV solution,
marketing campaigns, offer and order to
CSPs can optimize customer experience that
servicing and retention efforts, BDV helps to
delivers the right interaction with each customer
inform its interactions with that customer’s
at right time so as to build strong relationships,
preferences, the context of her relationship with
reduce churn, and increase customer value to
the business, and actual and potential value,
the business.
enables CSPs focus on creating personalized experiences that balance the customer’s needs
■■ From Customer Experience Perspective:
with business values.
It provides CSP with real-time, endto end visibility into all the customer
■■ Campaign Management: BDV delivers
interaction events taking place across
focused campaigns on the customer with
multi-channels, by correlating and
predictive modelling and cost-effective
analyzing these events, using a set of
campaign automation that consistently
business rules, and automatically takes
distinguishes the brand and supports
proactive actions which ultimately lead
personalized communications with
to customer experience optimization.
prospects and customers.
It helps CSP turn their multi-channel contacts with customers into cohesive,
■■ Offer Management: BDV dynamically
integrated interaction patterns, allowing
generates offers that account for such
them to better segment their customers
factors as the current interaction with
and ultimately to take full advantage of
the customer, the individual’s total value
that segmentation, deliver personalized
across product lines, past interactions,
experiences that are dynamically tailored
and likelihood of defecting. It helps
to each customer while dramatically
deliver optimal value and increases the
improving interaction effectiveness and
effectiveness of propositions with next-
efficiency.
best-action recommendations tailored to the individual customer.
■■ From CSPs Perspective: It helps CSPs quickly weed out underperforming
■■ Order Management: BDV enables the
campaigns and learn more about their
unified process automation applicable
customers and their needs. From retail
to multiple product lines, with agile and
32
flexible workflow, rules and process
that streamlines complex interactions; and
orchestration that accounts for the
automate interactions from end-to-end. The
individual needs in product pricing,
result is an optimized customer experience
configuration, processing, payment
that helps CSPs substantially increase customer
scheduling and delivery.
satisfaction, retention and profitability, and consequently empowers CSPs evolving into
■■ Service Management: BDV empowers
the experience centric Tomorrow’s Enterprise.
customer service representatives to act based on the unique needs and
REFERENCES
behaviours of each customer using real-
1. IBM Big data solutions deliver insight
time intelligence combined with holistic
and relevance for digital media – Solution
customer content and context.
Brief- June 2012 available at www-05. ibm.com/fr/events/netezzaDM.../
■■ Retention Management: BDV helps
Solutions_Big_Data.pdf.
CSPs retain more high-value customers
2. Oracle Big data Premier-Presentation
with targeted next-best-action
(May 2012). Available at http://
dialogues. It consistently turns customer
premiere.digitalmedianet.com/articles/
interactions into sales opportunities
viewarticle.jsp?id=1962030.
by automatically prompting customer
3. SAP HANA™ for Next-Generation
service representatives to proactively
Business Applications and Real-Time
deliver relevant offers to satisfy each
Analytics (July 2012). Available at http://
customer’s unique need.
www.saphana.com/docs/DOC-1507. 4. SAS® High-Performance Analytics (June
CONCLUSION
2012). Available at http://www.sas.
Today’s increasingly sophisticated digital
com/reg/gen/uk/hpa?gclid=CJKpvv
consumers expect CSPs to deliver product,
CJiLQCFbMbtAodpj4Aaw.
service and interaction experience designed
5. Transform the Customer Experience
“just for me at this moment.” To take on the
with Pega-CRM (2012). Available at
challenge, CSPs need to deliver customer
http://www.pega.com/sites/default/
experience optimization powered by BDV in
files/private/Transform-Customer-
real time.
Experience-with-Pega-CRM-WP-
By implementing an overarching 3M
Apr2012.pdf.
BDV framework to execute a holistic 5C process
6. The Forrester Wave™: E n t e r p r i s e
new products can be brought to market with
Hadoop Solutions for Big data-Feb 2012.
faster velocity and with the ability to easily
Available at http://center.uoregon.
adapt common services to accommodate unique
edu/AIM/uploads/INFOTEC2012/
customer and channel needs.
HANDOUTS/KEY_2413506/ Infotec2012BigDataPresentationFinal.
Suffice it to say that BDV will enable
pdf.
CSP to deliver customer-focused experience that matches responses to specific individual
7. S h a h S . ( 2 0 1 2 ) , T o p 5 R e a s o n s
demands; provide real time intelligent guidance
Communications Service Providers
33
Need Operational Intelligence. Available
8. Connolly S. and Wooledge S. (2012),
at http://blog.vitria.com/bid/88402/
Harnessing the Value of Big data Analytics.
Top-5-Reasons-Communications-Service-
Available at http://www.asterdata.com/
Providers-Need-Operational-Intelligence.
wc-0217-harnessing-value-bigdata/.
34
Infosys Labs Briefings VOL 11 NO 1 2013
Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha
Implement a Big Data framework and manage your liquidity risk better
D
uring the 2008 financial crisis, banks
until the 2007 crisis struck. The source of
faced an enormous challenge of managing
funding was mostly wholesale funding and
liquidity and remaining solvent. As many
capital market funding. Hence in the 2008 crisis,
financial institutions failed, those who survived
when these funding avenues dried up across
the crisis have fully understood the importance
the globe, it was unable to fund its operations.
of liquidity risk management. Managing
During the crisis, the bank’s stock fell 32% along
liquidity risk on simple spreadsheets can lead
with depositors run on the bank. The central
to non-real-time and inappropriate information
bank had to intervene and support the bank
that may not be enough for efficient liquidity
in the form of deposit protection and money
risk management (LRM). Banks must have
market operations. Later the Government took
reliable data on daily positions and other
the ultimate step of nationalizing the bank.
liquidity measures that have to be monitored
Lehman Brothers had 600 billion in
continuously. During signs of stress, like
assets before its eventual collapse. The bank’s
changes in liquidity of various asset classes
stress testing omitted its riskiest asset -- the
and unfavorable market conditions, banks need
commercial real estate portfolio, which in
to react to these changes in order to remain
turn led to misleading stress test results. The
credible in the market. In banking liquidity
liquidity of the bank was very low compared to
risk and reputation is so heavily linked to the
the balance sheet size and the risks it had taken.
extent that even a single liquidity event can lead
The bank had used deposits with clearing banks
to catastrophic funding problems for a bank.
as assets in its liquidity buffer which was not in compliance with the regulatory guidelines.
MISMANAGEMENT OF LIQUIDITY RISK:
The bank lost 73% in share price during the
SOME EXAMPLES OF FAILURES
first half of 2008, and filed for bankruptcy in
Northern Rock was a star performer UK bank
September 2008.
35
2008 financial crisis has shown that
necessary actions in time. The various liquidity
the current liquidity risk management (LRM)
parameters can be changing funding costs,
approach is highly unreliable in a changing and
counterparty risks, balance sheet obligations,
difficult macroeconomic atmosphere. The need
and quality of liquidity in capital markets.
of the hour is to improve operational liquidity management on a priority basis.
THE NEED OF A READY-MADE SOLUTION In a recent Swift survey, 91% respondents
THE CURRENT LRM APPROACH AND
indicated that there is a lack of ready-made
ITS PAIN POINTS
liquidity risk analytics and business intelligence applications to complement risk integration
Compliance/Regulation
processes. Since we can see that the regulation
Across global regulators, LRM principles have
around the globe in form of Basel III, Solvency
become stricter and complex in nature. The
II, CRD IV, etc., are shaping up hence there is
regulatory focus is mainly on areas like risk
an opportunity to standardize the liquidity
governance, measurement, monitoring and
reporting process. A solution that can do this
disclosure. Hence, the biggest challenge for
can be of great help to banks as it would save
the financial institutions worldwide is to react
them both effort and time, as well as increase
to these regulatory measures in an appropriate
the efficiency of reporting. Banks can focus
and timely manner. Current systems are not
solely on the more complex aspects like inputs
equipped enough to handle these changes. For
to the stress testing process and on business and
example, LRM protocols for stress testing and
strategy to control liquidity risk. Even though
contingency funding planning (CFP) focus
there can be differences in approach of various
more on the inputs to the scenario analysis and
banks in managing liquidity, these changes
new stress testing scenarios. These complex
can be incorporated in the solution as per the
inputs need to be very clearly selected and
requirements.
hence it poses a great challenge for the financial institution.
CHALLENGES/SCOPE OF REQUIREMENTS FOR LRM
Siloed Approach to Data Management
The scope of requirements for LRM ranges
Many banks use a spreadsheet-based LRM
from concentration analysis of liquidity
approach that gets data from different sources
exposures, calculation of average daily peak
which are neither uniform nor comparable.
of liquidity usage, historical and future view
This leads to a great amount of risk in manual
of liquidity flows on both contractual and
processes and data quality issues. In such
behavioral in nature, collateral management,
a scenario, it becomes impossible to collate
stress testing and scenario analysis, generate
enterprise wide liquidity position and the risk
regulatory reports, liquidity gap across buckets,
remains undetectable.
contingency fund planning, net interest income analysis, fund transfer pricing, to capital
Lack of Robust LRM Infrastructure
allocation. All these liquidity measures are
There is a clear lack of a robust system which
monitored and alerts generated in case of
can incorporate real-time data and generate
thresholds breached.
36
Concentration analysis of liquidity exposures
Regulatory liquidity reports have Basel III
shows some important points on whether
liquidity ratios like liquidity coverage ratio
the assets or liabilities of the institution are
(LCR), net stable funding ratio (NSFR), FSA and
dependent on a certain customer, or a product
Fed 4G guidelines, early warning indicators,
like asset or mortgage backed securities. It also
funding concentration, liquidity asset/
tries to see if the concentration is region wise
collateral, and stress testing analysis. Timely
country wise, or by any other parameter that can
completion of these reports in the prescribed
be used to detect a concentration for the overall
format is important for financial institutions to
funding and liquidity situation.
remain complaint with the norms.
Calculation of average daily peak of liquidity
Net interest income analysis (NIIA), FTP and
usage gives a fair idea of the maximum intraday
capital allocation are performance indicators
liquidity demand and the firm can keep
for an institution that raises money from
necessary steps to manage the liquidity in ideal
deposits or other avenues and lends it to
way. The idea is to detect patterns and in times
customers, or performs an investment to
of high, low or medium liquidity scenarios
achieve a rate of return. The NII is the difference
utilize the available liquidity buffer in the most
between the cost of funds to the interest rate
optimized way.
achieved by lending or investing the same. The implementation of FTP links the liquidity risk/
Collateral management is very important
market risk to the performance management
as the need for collateral and its value
of the business units. The NII analysis helps in
after applying the required haircuts has
predicting the future state of the P/L statement
to be monitored on a daily basis. In case
and balance sheet of the bank.
of unfavorable margin calls the amount of collateral needs to be adjusted to avoid default
Contingency fund planning contains of
in various outstanding positions.
wholesale, retail and other funding reports in areas of both secured and unsecured funds, so
Stress testing and scenario analysis is like a
that in case of these funding avenues drying up
self-evaluation for the banks, in which they
banks can look for other alternatives. It states
need to see how bad things can go in case of
the reserve funding avenues like use of credit
high stress events. Internal stress testing is
lines, repro transactions, unsecured loans, etc.,
very important to see the amount of loss in case
that can be accessed timely and at a reasonable
of unfavorable events. For the systematically
cost in liquidity crisis situation.
important institutions, regulators have devised some stress scenarios based on the past crisis
Intra-group borrowing and lending reports
events. These scenarios need to be given as an
show the liquidity position across group
input to the stress tests and the results have
companies. Derivatives reports related to
to be given to the regulators. A proper stress
market value, collateral and cash flows are very
testing ensures that the institution is aware
important to an efficient derivatives portfolio
of what risk it is taking and what can be the
management. Bucket-wise and cumulative
consequences of the same.
liquidity gap under business as usual and stress
37
and must have the autonomy to take liquidity
Corporate Governance
Identify & Assess Liquidity Risk Monitor & Report
defining the liquidity risk policy in a clear
Take Corrective Measures
Strategic Level Planning
decisions. Strategic level planning helps in manner related to the overall business strategy of the firm. The risk appetite of the firm needs to be mentioned in measurable terms and the same has to be communicated to all the stakeholders
Periodic Analysis for Possible Gaps
in the firm. Liquidity risks across the business Figure 1: Iterative Framework for effective liquidity risk management Source: Infosys Research
need to be identified and the key risk indicators and metrics are to be decided. Risk indicators are to be monitored on a regular basis, so that in the case of an upcoming stress scenario
scenario situations give a fair idea of varying
preemptive steps can be taken. Monitoring and
liquidity across time buckets. Both contractual
reporting is to be done for internal control as
and behavioral cash flows are tracked to get the
well as for the regulatory compliance.
final inflow and outflow scenario. This is done
Finally there has to be a periodic analysis
over different time periods, like 30 days to 3
of the whole system in order to identify possible
years to get a long term as well as short term
gaps in it and the frequency of review has to be
view of liquidity. Historic cash flows are tracked
at least once in a year and in case of extreme
as they help in modeling the future behavioral
markets scenarios more frequently.
cash flows. Historical assumptions plus current
To satisfy the scoped out requirements
market scenarios are very important in dynamic
we can see that the data from various sources
analysis of behavioral cash flows. Other
is used to form liquidity data warehouse and
important reports are related to available pool
datamart which acts as an input to the analytical
of unencumbered assets and non-marketable
engines.
assets.
The engines contain business rules and All the scoped requirements can only
logic based on which the key liquidity parameters
be satisfied when the firm has a framework
are calculated. All the analysis is presented in
in place to take necessary decisions related
report and dashboards form for both regulatory
to liquidity risk. Hence, next we would have
compliance and internal risk management as well
a look into a LRM framework and as well as
as for decision making purposes.
a data governance framework for managing liquidity risk data.
Some Uses of Big data Application in LRM 1. S t a g i n g A r e a C r e a t i o n f o r D a t a
LRM FRAMEWORK
Warehouse: Big data application
Separate group for LRM that is a constituted
can store huge volumes of data and
of members from the asset liability committee,
perform some analysis on it along with
risk committee and top management needs
aggregating data for further analysis.
to be formed. This group must function
Due to its fast processing for large
independent of the other groups in the firm
amount of data it can be used as loader to
38
Data Store
Data Sources
Market Data
Data Warehouse
Reference data Load System of Records Collateral, Deposits, Loans, Securities, Product/LOB
Big Data Application Data quality/ Data checks/ Operational Data Store/ Staging Layer
ETL
DataMart ETL
General Ledger Reconciliation
Analytical Engine Asset Liability Management . Fund Transfer Pricing Liquidity Risk & Capital Calculation
General Ledger
External Data
Figure 2: LRM data governance framework for Analytics and BI with Big data capabilities
Reporting / BI Regulatory Reports Basel related ratios NSFR & LCR FED 4G FSA reports Stress testing Reports Regulatory capital allocation Internal Liquidity related Reports Net interest income analysis ALM reports FTP & liquidity costs Funding Concentration Liquid assets Capital allocation & planning Internal stress test Key risk indicators Other reports
Source: Infosys Research
load data into the data warehouse along
Billions of records can now be processed
with facilitating the extract-transform-
at increasingly amazing speeds.
load (ETL) processes. HOW BIG DATA CAN HELP IN LRM 2. Preliminary Data Analysis: Data can be
ANALYTICS AND BI
moved in from various sources and then using a visual analytics tool to create a
■■ Operational efficiency and swiftness is a
picture of what data is available and how
point where high performance analytics
it can be used.
can help to achieve faster decision making because all the required analysis
3. Making Full enterprise Data Available for
is obtained much faster.
High performance Analytics: Analytics at large firms were often limited to
■■ Liquidity risk is a killer in today’s
the sample set of records on which
financial world and is most difficult to
the analytical engines would run and
tracks as for large banks have diverse
provide certain results, but as a Big data
instruments and a large number of
application provides distributed parallel
scenarios need to be analyzed like
processing capacity the limitation of
changes in interest rates, exchange
number of records is non-existent now.
rates, liquidity and depth in the markets
39
worldwide, and for such dynamic
possible with Big data applications. All in
analysis Big data analytics is a must.
the banking industry know that the future is uncertain and high margins will always be a
■■ Stress testing and scenario analysis, both
challenge, so an efficient data management
require intensive computing as lot of
along with Big data capabilities needs to be in
data is involved hence faster scenario
place. This will add value to the banks profile
analysis means quick action in case of
by clear focus on the new opportunities for
stressed market conditions. With Big
banks and bring predictability to their overall
data capabilities scenarios that would
businesses.
takes hours to otherwise run can now
Successful banks in future would be the
be run in minutes and hence aid in quick
ones who take LRM initiatives seriously and
decision making and action.
implement the system successfully. Banks with an efficient LRM system would definitely build
■■ Efficient product pricing can be achieved
a strong brand and reputation in the eyes of
by implementing real time fund transfer
investors, customers, and regulators around
pricing system and profitability
the world.
calculations. This ensures the best possible pricing of market risks along
REFERENCES
with adjustments like liquidity premium
1. Banking on Analytics: How High-
across the business units.
Performance Analytics Tackle Big data Challenges in Banking (2012), SAS white
CONCLUSION
paper. Available at http://www.sas.com/
The LRM system is the key for a financial
resources/whitepaper/wp_42594.pdf.
institution to survive in competitive and highly
2. New regime, rules and requirements —
unpredictable financial markets. The whole idea
welcome to the new liquidity, Basel lll:
of managing liquidity risk is to know the truth,
implementing liquidity requirements,
and be ready for the worst market scenarios.
ERNST & YOUNG (2011).
This predictability is what is needed, and can
3. Leveraging Technology to Shape the
save a bank in times like the 2008 crisis. Even
future of Liquidity Risk Management,
at the business level a proper LRM system can
Sybase Aite. Group study, July, 2010.
help in better product pricing using FTP, and
4. Managing liquidity risk, Collaborative
hence pricing can be logical and transparent.
solutions to improve position management
Traditionally data has been a headache
and analytics (2011), SWIFT white paper.
for banks and is seen more as compliance and
5. Principles for Sound Liquidity Risk
regulation requirement, but going forward
Management and Supervision, BIS
there are going to be even more stringent
Document, (2008).
regulations and reporting standards across
6. Technology Economics: The Cost of
the globe. After the crisis of 2008 new Basel III
Data, Howard Rubin, Wall Street and
liquidity reporting standards, newer scenarios
Technology Website, Available at http://
for stress testing have been issued that requires
www.wallstreetandtech.com/data-
extensive data analysis and can only be timely
management/231500503.
40
Infosys Labs Briefings VOL 11 NO 1 2013
Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi
Diagnose, customize and administer health care on real time using BDMEiC
I
magine a world, where the day to day data
RAMPANT HEALTHCARE COSTS
about an individual’s health is tracked,
A look at the healthcare expenditure of
transmitted, stored, analyzed on a real-time
countries like US and UK, would automatically
basis. Worldwide diseases are diagnosed at an
explain the burden that healthcare is on the
early stage without the need to visit a doctor.
economy. As per data released by Centers
And lastly a world, where every individual
for Medicare and Medicaid Services, health
will have a ‘life certificate’ that contains all
expenditure in the US is estimated to have
their health information, updated on a real
reached $2.7 trillion or over $8,000 per person
time basis. This is the world, to which Big data
[1]. By 2020, this is expected to balloon to $4.5
can lead us to.
trillion [2]. These costs will have a huge bearing
Given the amount of data generated for
on an economy that is struggling to get up on
e.g., , body vitals, blood samples, etc., every day
its feet, having just come out of a recession.
in the human body, it’s a haven for generating
According to the Office for National
Big data. Analyzing this Big data in healthcare is
Statistics in the UK, healthcare expenditure in
of prime importance. Big data analytics can play
UK amounted to £140.8 billion in 2010; from
a significant role in the early detection/advanced
£136.6 billion in 2009 [3]. With rising healthcare
diagnosis of such fatal diseases that which can
cost, countries like Spain have already pledged
reduce health care cost and improve quality.
to save €7 Billion by slashing health spending,
Hospitals, medical universities,
while also charging more for drugs [5]. Middle
researchers, insurers will be positively impacted
income earners will now have to pay more for
on applying analytics on this Big data. However,
drugs.
the principal beneficiaries of analyzing this Big
This increase in healthcare costs is not
data will be the Government, patients and
isolated to a few countries alone. According to
therapeutic companies.
World Health Organization statistics released
41
in 2011, per capita total expenditure on health
USING BIG DATA ANALYTICS FOR
jumped from US$ 566 to US$ 899 from 2000
PERSONALIZING DRUGS
to 2008, an alarming increase of 58% [4]. This
The patents of many high profile drugs are
huge increase is testimony to the fact that far
ending by 2014. Hence, therapeutic companies
from increasing steadily, healthcare costs have
need to examine the response of patients to
been increasing exponentially.
these drugs to help create personalized drugs.
While healthcare costs have been
Personalized drugs are those that are tailored
increasing, the data generated through body
according to an individual patient. Real time
vitals, lab reports, prescriptions, etc. has also
data collected from various patients will help
been increasing significantly. Analysis of this
generate Big data, the analysis of which will
data will lead to better and advanced diagnosis,
help identify how individual patients, reacted
early detection and more effective drugs which
to the drugs administered to them. By this
in turn will result in significant reduction in
analysis, therapeutic companies will be able
healthcare costs.
to create personalized drugs custom-made to an individual. A personalized drug is one of the
HOW BIG DATA ANALYTICS CAN HELP REDUCE HEALTHCARE COSTS?
important solutions that Big data analytics will
Analysis of ‘Big data’ that is generated from
have the power to offer. Imagine a situation
various real time patient records possesses a
where, analytics will help determine the exact
lot of potential for creating quality healthcare
amount and type of medicine that an individual
at reduced costs. Real time refers to data like
would require, even without them having to
body temperature, blood pressure, pulse/
visit a doctor. That’s the direction in which
heart rate, and respiratory rate that can
Big data analytics in healthcare has to move.
be generated every 2-3 minutes. This data
In addition, the analytics of this data can also
collected across individuals provides the
significantly reduce healthcare costs that run
volume of data at a high velocity, while also
into billions of dollars every year.
providing the required variety since it is obtained across geographies. The analysis
BIG DATA ANALYTICS FOR REAL TIME
of this data can help in reducing costs by
DIAGNOSIS USING BIG DATA MEDICAL
enabling real time diagnosis, analysis and
ENGINE IN THE CLOUD (BDMEIC)
medication, which offers
Big data analytics for real time diagnosis are characterized by real time Big data analytics
■■ Improved insights into drug effectiveness
systems. These systems contain a closed loop
■■ Insights for early detection of diseases
feedback system, where insights from the
■■ Improved insights into origins of various
application of the solution serve as feedback
diseases
for further analysis. (Refer Figure 1).
■■ Insights to create personalized drugs.
Access to real time data provides a quick way to accumulate and create Big data.
These insights that Big data analytics
The closed loop feedback system is important
provides are unparalleled and go a long way
because it helps the system in building its
in reducing the cost of healthcare.
intelligence. These systems can not only help
42
which is synced with the patch.
Real Time Medical Data
The extraction of the data happens at regular intervals (every 2-3 minutes).
New solution based on analysis
The smartphone transmits the real time data to the data center in the medical
Real Time Big Data Analytics system Analysis of real time data
Feedback
e n g i n e . The thigh based electronic medical patch is used for providing
Newer Insights from Solutions
medication. The patch comes with a drug cartridge (pre-loaded drugs) that
Figure 1: Real Time Big Data Analytics System Source: Infosys Research Source: Infosys Research
can be inserted into a slot in the patch. When it receives data from the smartphone, the device can provide the required medication to the patient
to monitor patients in real time but can also
through auto-injectors that are a part of
be used to provide diagnosis, detect early and
the drug cartridge.
deliver medication in real time. This can be achieved through a Big data
2. Data Center
Medical Engine in the Cloud (BDMEiC) [Fig. 2].
The data center is the Big data cloud storage that receives real time data from
This solution would consist of:
the medical patch and stores it. This data
■■ Two medical patches (arm and thigh)
center will be a repository of real time
■■ Analytics engine
data received across different individuals
■■ Smartphone
across geographies. This data is then
■■ Data Center.
transmitted to the Big data analytics engine
As depicted above, the BDMEiC solution
3. Big Data Analytics Engine
consists of the following:
The Big data analytics engine performs three major functions - analyzing data, sharing analyzed data with organizations
1. Arm and thigh based electronic medical
patch
and transmitting medication instructions
An arm based electronic medical patch
back to the smartphone.
(these patches are thin, lightweight, elastic and have embedded sensors) that
• Analyzing Data: It analyzes the
can monitor the patient is strapped to the
data (like body temperature, blood
arm of an individual , which reads vitals
pressure, pulse/heart rate, and
like body temperature, blood pressure,
respiratory rate, etc.) received
pulse/heart rate, and respiratory rate to
from the data center using its
monitor brain, heart, muscle activity, etc.
inbuilt medical intelligence, across individuals. As the system keeps
The patch then transmits this real time
analyzing this data it also keeps
data to the individual’s smartphone
building on its intelligence.
43
Real time Medication
Organizations Medical Engine 2
Data Center
With the analytics engine, monitoring patient data in real time, the diagnosis and treatment of
Medical Labs
Analytics 3 Engine
patients in real time is possible. With the data
Medical Universities
being shared with top research facilities and
Medical Research Centers
1 4
medical institutions in the world, the diagnosis and treatment would be more effective and
Therapeutic Companies
accurate.
Figure 2: Big Data Medical Engine in the Cloud (BDMEiC) Source: Infosys Research
Specific Instances: Blood pressure data can be monitored real time and stored in the data center. The analysis of this data by the analytics engine can keep the patients as well as doctor updated real time, if the blood pressure moves
• Sharing Analyzed Data: The analytics
beyond permissible limits.
engine also transmits its analysis to various universities, medical centers, therapeutic companies and other
Beneficiaries: Patients, medical institutions and
related organizations for further
research facilities.
research. Convenience • T r a n s m i t t i n g
Medication
The BDMEiC solution offers convenience to
Instructions: The analytics engine
patients, who would not always be in a position
also can transmit medication
to visit a doctor.
instructions to an individual’s smartphone, which in turn
Specific Instances: Body vitals can be measured
transmits data to the thigh patch,
and analyzed with the patient being at home.
whenever medication has to be
This especially helps in the case of senior citizens
provided.
and busy executives who can now be diagnosed and treated right at home or while on the move.
The BDMEiC solution can act as a Beneficiaries: Patients.
real time doctor that diagnoses, analyzes, and provides personalized medication to individuals. Such a solution that harnesses the
Insights into drug effectiveness
potential of Big data provides manifold benefits
The system allows doctors, researchers and
to various beneficiaries.
therapeutic companies to understand the impact of their drugs in real time. This helps them to create better drugs in the future.
BENEFITS AND BENEFICIARIES OF BDMEIC The BDMEiC solution if adopted in a large scale
Specific Instances: The patents of many high
manner can offer a multitude of benefits, few of
profile drugs are ending by 2014. Therapeutic
which are listed below.
companies can use BDMEiC to perform real
44
Beneficiaries: Patients and doctors
time Big data analysis, to understand their existing drugs better, so that they can create better drugs in the future.
Reduced Costs Real time data collected from BDMEiC assists in
Beneficiaries: Doctors, researchers and
the early detection of diseases, thereby reducing
therapeutic companies
the cost of treatment.
Early Detection of Diseases
Specific Instances: Early detection of cancer and
As BDMEiC monitors, stores, and analyzes data
other life threatening diseases can lead to lesser
in real time, it allows medical researchers, doctors
spending on healthcare.
and medical labs to detect diseases at an early stage. This allows them to provide an early cure.
Beneficiaries: Government and patients.
Specific Instances: Early detection of diseases
CONCLUSION
like cancer, childhood pneumonia, etc., using
The present state of the healthcare system
BDMEiC can help provide medication at an
leaves a lot to be desired. Healthcare costs
early stage thereby increasing the survival rate.
are spiraling and forecasts suggest that they are not poised to come down any time soon.
Beneficiaries: Researchers, medical Labs and
In such a situation, organizations world over,
patients.
including governments should look to harness the potential of real time Big data analytics
Improved Insights into Origins of Various
to provide high quality and cost effective
Diseases
healthcare. vThe solution proposed in this
With BDMEiC storing and analyzing real time
paper, tries to utilize this potential to bridge
data, researchers get to know the cause and
the gap between medical research, and the final
symptoms of a disease much better and at an
delivery of the medicine.
early stage. REFERENCES Specific Instances: Newer strains of viruses can
1. US Food and Drug Administration, 2012
be monitored and researched in real time.
2. National Health Expenditure Projections 2011-2021 (January 2012), Centers for
Beneficiaries: Researchers and medical labs.
Medicare & Medicaid Services, Office of the Actuary. Available at http://www.
Insights to Create Personalized Drugs
cms.gov/Research-Statistics-Data-
Real time data collected from BDMEiC will help
and-Systems/Statistics-Trends-and-
doctors administer the right dose of drugs to
Reports/NationalHealthExpendData/
the patients.
Downloads/Proj2011PDF.pdf. 3. Jurd, A. (2012), Expenditure on healthcare
Specific Instances: Instead of a standard pill,
in the UK 1997 - 2010, Office for National
patients can be given the right amount of drugs,
Statistics. Available at http://www.ons.
customized according to their needs.
gov.uk/ons/dcp171766_264293.pdf .
45
4. World Health Statistics 2011, World
5. The Ministry of Health, Social Policy and
Health Organization. Available at
Equality Spain (). Available at http://
http://www.who.int/whosis/
www.msssi.gob.es/ssi/violenciaGenero/
whostat/EN_WHS2011_Full.pdf .
publicaciones/comic/docs/PilladaIngles.pdf.
46
Infosys Labs Briefings VOL 11 NO 1 2013
Big Data Powered Extreme Content Hub By Sudheeshchandran Narayanan and Ajay Sadhu
Taming Big content explosion and providing contextual and relevant information is the need of the day
C
ontent is getting bigger by the minute
interacting with the content for e.g., mobile
and smarter by the second [5]. As
devices and tablets , there is a need to re-
content grows in size and becomes varied in
look at the traditional content management
structure, discovery of valuable and relevant
strategies. Artificial intelligence will now play
content becomes a challenge. Existing Content
a key role in information retrieval, information
Management (ECM) products are limited
classification and usage for these sophisticated
by scalability, variety, rigid schema, limited
users. To facilitate the usage of Artificial
indexing and processing capability.
Intelligence on this Big Content, there is a need
Content enrichment often is an external
to have knowledge on entities, domain, etc., to
activity and not often deployed. The content
be captured, processed, reused, and interpreted
manager is more like a content repository
by the computer. This has resulted in formal
and is used primarily for search and retrieval
specification and capture of the structure of
of the published content. Existing content
the domain called ontologies. Classification
management solutions can handle few data
of these entities within the domain into
formats and provide very limited capability
predefined categories called taxonomy and
with respect to content discovery and
inter-relating them to create the semantic web
enrichment.
(web of data).
With the arrival of Big Content, the
The new breed of content management
need to extract, enrich, organize and manage
solutions need to bring in elastic indexing,
the semi-structured and un-structured content
distributed content storage and low latency
and media is increasing. As the next generation
to address these changes. But the story
of users will rely heavily on the new modes of
does not end there. The ease to deploy
47
t e c h n o l o g i e s like natural language text
THE BIG CONTENT PROBLEM IN TODAYS
analytics, machine learning now takes these
ENTERPRISES
new breed of content management to the
Legacy Content Management System (CMS)
next level of maturity. Time is the essence for
has focused on addressing the fundamental
everyone today. Contextual filtering of the
problems in content management i.e., content
content based on relevance is an immediate
organization, indexing, and searching. With
need. There is a need to organize content,
the internet evolution, these CMS’ evolved
create new taxonomy, and create new links
to Content Publishing Lifecycle Management
and relationships beyond what is specified.
(CPLM) and workflow capabilities to the overall
The next generation of content management
offering. The focus of these ECM products were
solutions should leverage the ontologies,
towards providing a solution for the enterprise
semantic web and linked data to derive the
customers to easily store and retrieve various
context of the content and enrich the content
documents and provide a simplified search
metadata with this context. Then leveraging
interface. Some of these solutions evolved to
this context, the system should provide real-
address the web publishing problem. These
time alerts as the content arrives.
existing content management solutions have
In this paper, we discuss the details of
constantly shown performance and scalability
the extreme content hub and its implementation
concerns. Enterprises have invested in high
semantics, technology viewpoint and use
end servers and hired performance engineering
cases.
experts to address this. But will this last long?
Automated Content Discovery
Heterogeneous Content Ingestion
Core Features • Indexing • Search • Workflow • Metadata Repository • Content Versioning
Unified Intelligent Content Access and Insights
Highly Available Elastic Scalable System
Content Enrichment
Figure 1: Augmented Capabilities of Extreme Content Hub Manager
Source: Infosys Research
48
With the arrival of Big data (volume,
Automated Content discovery that extracts the
variety and velocity), these problems have
metadata and classifies the incoming content
amplified further and the need for next
seamlessly to pre-defined ontologies and
generation capabilities for content management
taxonomies.
has evolved further. Requirements and demand has gone
Scalable, Fault-tolerant Elastic System that can
just beyond storing, searching and indexing
seamlessly expand to the demands of volume,
of traditional documents. Enterprise needs
velocity and variety growth of the content.
to store a wide variety of contents ranging from documents, videos, social media feeds,
Content Enrichment services that leverages
blogs posts, podcast, images, etc. Extraction,
machine learning and text analytics technologies
enrichment, organization and management
to enrich the context of the incoming content.
of semi, unstructured and multi-structured content and media are a big challenge today.
Unified Intelligent Content Access that
Enterprises are under tremendous competitive
provides a set of content access services that
pressure to derive meaningful insights from
are context aware and based on information
these piles of information assets and derive
relevance by user modeling and personalization.
business value from this Big data. Enterprises
To realize ECH, there is a need to
are looking for contextual and relevant
augment the existing search and indexing
information at lightning speed. The ECM
technologies with the next generation of
solution must address all of the above technical
machine learning and text analytics to bring
and business requirements.
in a cohesive platform. The existing content management solution still provides quite a good list of features that cannot be ignored.
EXTREME CONTENT HUB: KEY CAPABILITIES Key capabilities required for the Extreme Content
BIG DATA TECHNOLOGIES: RELEVANCE
Hub (ECH) apart from the traditional indexing,
FOR THE CONTENT HUB
storage and search capabilities can be classified
With the advent of Big data, the technology
in the following five dimensions. (Fig. 2)
landscape has made a significant shift. Distributed computing has now become a key that
enabler for large scale data processing and with
provides input adapters to a wide variety of
open source contributions this has received a
content (document, videos, images, blogs,
significant boost in recent years. Year 2012 has
feeds, etc.) into the content hub seamlessly. The
been the year for large scale Big data technology
next generation of content management system
adoption.
Heterogeneous
Content
Ingestion
needs to support
The other significant advancement has been in the NoSQL (Not Only SQL)
Real-Time Content Ingestion for RSS feeds,
technology which complements the existing
news feeds, etc. and support stream of events
RDBMS systems for scalability and flexibility.
to be ingested as one of the key capabilities for
Scalable near real-time access provided by these
content ingestion.
systems has boosted the adoption of distributed
49
Unified Enterprise Content Access
Social Feed Integration
Log Feeds from various enterprise system
Alerts & Content API Service
Dashboard
Content Services Content Classification Service
Search Services Metadata Extractor
Existing Enterprise Content
Un-Structured Content Extractor
Extreme Content Hub Content Management Interface
Content Classification Service
Machine Learning Algorithms Auto Classifier Index Storage (Hbase)
Recommendation Link Storage (Hbase)
Rule Engine
Distributed File System (Hadoop)
Unified Content Extractor
Existing Enterprise CM
Knowledge Feeds to various existing systems
Metadata Driven Augmented CM Processing Framework (Generic Transformation, Dynamic Cluster Expansion, Audit Logging) News, Alerts & RSS Feeds (Real Time)
Content Processing Workflows (Task Co-ordination, sequencing, scheduling etc. for Backend Processing)
Figure 2: Extreme Content Hub
Source: Reference [12]
computing for real-time data storage and
REALIZATION OF THE ECH
indexing needs.
ECH requires a scalable fault tolerant
Scalable and elastic deployments
elastic system that provides scalability on
provided by the advancement in private and
storage, compute and network infrastructure.
public cloud deployments has accelerated
Distributed processing technologies like
adoption of distributed computing in enterprises.
Hadoop provide the foundation platform
Overall, there is a significant change from our
for this. Private cloud based deployment
earlier approaches to solve the ever increasing
model will provide the on-demand elasticity
data and performance problem by throwing
and scale that is required to setup such a
more hardware at the problem. Today deploying
platform.
a scalable distributed computing infrastructure
Metadata model driven ingestion
that not only addresses the velocity, variety
framework could ingest a wide variety of
and volume problem but also providing it at
feeds to the hub seamlessly. Content ingestion
a cost effective alternative using open source
could deploy content security tagging during
technologies provides the business case for
the ingestion process to ensure that the content
building the ECH. The solution to the problem
stored inside the hub is secured and authorized
is to augment the existing content management
before access.
solution with the processing capabilities of the
NoSQL technologies like HBase and
Big data technologies to create a comprehensive
MongoDB could provide the scalable metadata
platform that brings in the best of both worlds.
repository needs for the system.
50
Search and indexing technologies have
ECH could extend as an analytics
matured to be next level after the advent of the
platform for video and text analytics. Real-
Web 2.0 0 and deploying a scalable indexing
time information discovery can be facilitated
service like Solr, Elastic Search, etc., provides
using pre-defined alerts/rules which could get
the much needed scalable indexing and search
triggered as new content arrives in the hub.
capability required for the platform.
The derived metadata and context could
Deploying machine learning algorithms
be pushed to the existing content management
leveraging Mahout and R on this platform can
solution to derive the benefits and investments
bring in auto-discovery of the content metadata
done on the existing products and platforms
and auto-classification for content enrichment.
and augment the processing and analytics
De-duplication and other value added services
capabilities with new technologies.
can be seamless deployed as batch framework
ECH will now be able to handle large
on the Hadoop infrastructure to bring value
volumes, wide variety of content formats and
added context to the content.
bring in deep insights leveraging the power of
Machine learning and text analytics
machine learning. These solutions will be very
technologies can be further leveraged to provide
cost effective and will also leverage existing
the recommendation and contextualization of
investment in the current CMS.
the user interactions to provide unified context aware services.
CONCLUSION There need is to take a platform centric approach
BENEFITS OF ECH
to this Big content problem rather than a
ECH is at the center of enterprise knowledge
standalone content management solution. There
management and innovation. Serving contextual
is a need to look at it strategically and adopt a
and relevant information to the users will be one
scalable architecture platform to address this.
of the fundamental usages ECH.
However such initiative doesn’t need to replace
Auto-indexing will help discover
the existing content management solutions but
multiple facets of the content and help in
to augment the capabilities to fill in required
discovering new patterns and relationships
white spaces. The approach discussed in this
between the various entities that would have
paper provides one such implementation of the
been particular unnoticed in the legacy world.
augmented content hub leveraging the current
The integrated metadata view of the content will
advancement in Big data technologies. Such
help in building a 360 degree view on a particular
an approach will provide the enterprise with a
domain or entity from the various sources.
competitive edge in years to come.
ECH could enable discovery of user taste and likings based on the content searched
REFERENCES
and viewed. This could serve real-time
1. Agichtein, E., Brill, E. and Dumais, S.
recommendation to users through content
(2006), Improving web search ranking by
hub services. This could help the enterprise
incorporating user behavior. Available `at
in specific user behavior modeling. Emerging
http://research.microsoft.com/en-us/
trends in the various domains can be discovered
um/people/sdumais/.
as content gets ingested on the hub.
2. Dumain, S. (2011), Temporal Dynamics
51
and Information Retrieval. Available at
Choose, Develop and Implement a
http://research.microsoft.com/en-us/
Semantic Strategy, http://www.
um/people/sdumais/.
kapsgroup.com/presentations/
3. R e a m y , T . ( 2 0 1 2 ) , T a x o n o m y a n d
ContentCategorization-Development.pdf.
Enterprise Content Management.
5. Barroca, E. (2012), Big data’s Big
Available at http://www.kapsgroup.
Challenges for Content Management,
com/presentations.shtml.
TechNewsWorld. Available at http://
4. Reamy, T. (2012), Enterprise Content
www.technewsworld.com/story/74243.
Categorization – How to Successfully
html.
52
Infosys Labs Briefings VOL 11 NO 1 2013
Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur
Analyze, crunch and detect unforeseen conditions in real time through CEP of Big Data
A
study by The Economist revealed that 1.27
analyzed as well as a need for timely processing
Zettabyte was the amount of information
and decision making. Any delay even in
in existence in 2010 as household data [1]. The
seconds or milliseconds affects the outcome.
Wall Street Journal reported Big data as the
Significantly, technology should be capable of
new boss in all key sectors such as education,
interpreting historical patterns, apply them to
retail and finance. But on the other side, an
current situations and take accurate decisions
average Fortune 500 enterprise is estimated
with minimal human interference.
to have around 10 years’ worth of customer
Big data is about the strategy to deal
data and more than two-thirds of it being
with vast chunk of incomprehensible data sets.
unusable. How can enterprises make such an
There is now awareness across industries that
explosion of data usable and relevant? Not
traditional methods of data stores and processing
trillions but quadrillions amount of data for
power like databases, files, mainframes or even
analysis overall and it is expected to increase
mundane caching cannot be used as a solution
exponentially and evidently impacts businesses
for Big data. Still the existing models do not
worldwide. Additionally the problem is of
address capabilities of processing, analysis
providing speedier results and that is expected
of data, integrating with events and real time
to go slower with more data to analyze unless
analytics, all in split second intervals.
technologies innovate in the same pace.
On the other hand, Complex Event
Any function or business, whether it is
Processing (CEP) has evolved to provide
road traffic control, high frequency trading,
solutions in utilizing memory data grids for
auto adjudication of insurance claims or
analyzing trends, patterns and events in real
controlling supply chain logistics of electronics
time and assessments in a matter of milliseconds.
manufacturing, all requires huge data sets to be
However, Event Clouds, a byproduct of using
53
CEP techniques, can be further leveraged to
this there is a need to analyze traffic data over
monitor for unforeseen conditions birthing, or
different parameters such as rush hour, accidents,
even the emergence of an unknown-unknown,
seasonal impacts of snow, thunderstorms, etc.,
creating early awareness and potential first
and come up with predictable patterns over years
mover advantage for the savvy organization.
and decades. Second is application of this pattern
To set the context of the paper we attempt
to input conditions. All this requires huge data
at highlighting how CEP with in-memory data
crunching, analyses and on top of it real time
grid technologies helps in pattern detection,
application such as CEP.
matching, analysis, processing and decision
Big data has already taken importance in
making in split seconds with the usage of Big
financial market particularly in high frequency
data. This model should serve any industry
trading. Since the 2008 economic downturn
function where time is the essence and Big
and its rippling effects on the stock market, the
data is at the core and CEP acts as the mantle.
volume of trade has come down at all the top
Later, we propose treating an Event Cloud as
exchanges such as New York, London, Singapore,
more than just an event collection bucket used
Hong Kong or Mumbai. But the contrasting factor
for event pattern matching or as simply the
is the rise in High Frequency Trading (HFT). It is
immediate memory store of an exo-cortex for
claimed that around 70% of all equity trades were
machine learning; an Event Cloud is also a robust
accounted by HFT in 2010 versus 10% in 2000.
corpus with its own intrinsic characteristics that
HFT is 100% dependent on technology and the
can be measured, quantified, and leveraged for
trading strategies are developed out of complex
advantage. For example, by automating the
algorithms. Only those trades will have a better
detection of a shift away from an Event Cloud’s
win ratio that has developed a better strategy
steady state, the emergence of a previously
and has more data to crunch in faster time. This
unconsidered situation may be observed. It is
is where CEP could be useful.
this application, programmatically discerning
The healthcare industry in USA is set to
the shift away from an Event Cloud’s normative
undergo a rapid change with the Affordable
state, which is explored in this paper.
Care Act. Healthcare insurers are expected to see an increase in their costs due to increased risks
CEP AS REAL TIME MODEL FOR BIG DATA:
of covering more individuals and legally cannot
SOME RELEVANT CASES
deny insurance with pre-conditions. Hospitals
In current times, traffic updates are integrated
are expected to see more patient data which
with cities traffic control system as well as
means increased analyses and pharmaceutical
many global positioning service (GPS) electronic
companies need better integration with the
receivers used quite commonly by drivers. These
insurers and consumers to have speedier and
receivers automatically adjust and reroute in case
accurate settlements. Even though most of these
of the normal route is traffic ridden. This helps
transactions can be performed on non-real time
but the solution is reactionary. Many technology
basis, technology still needs both Big data and
companies are investing in pursuit of the holy
complex processing for a scalable solution.
grail of the solution to detect and predict traffic
In India the outstanding cases in various
blockages and take proactive action to control
judicial courts touch 32 million. In USA, family
the traffic itself and even avoid mishaps. For
based cases and immigration related ones
54
are piling up waiting for a hearing. Judicial
data. Information available as part of health
pendency has left no country untouched.
records, geo maps, multimedia (audio, video
Scanning through various federal, state and local
and picture) is essential for many businesses
law points, past rulings, class suits, individual
and mining such unstructured sets require
profiles, evidence details etc., are required to put
storage power as well as transaction processing
forward the cases for the parties involved and the
power. Add this to the variety of sources such as
winner is the one who is able to present a better
social media, legacy systems, vendor systems,
analysis of available facts. Can technology help
localized data, mechanical and sensor data.
in addressing such problems across nations?
Finally the critical component of Speed to get
All of these cases across such diverse
the data through the steps of Unstructured →
industries showcase the importance of
Structured → Storage → Mine → Analyze →
processing gigantic amounts of data and also
Process → Crunch → Customize → Present.
the need to have the relevant information churned out in right time.
BIG DATA METHODOLOGIES: SOME EXAMPLES
WHY AND WHERE BIG DATA
Apache™ Hadoop™ project [2] and its relatives
Big data has evolved due to the existing limitations
such as Avro™, ZooKeeper™, Cassandra™,
of current technologies. Two-tier or multi-
Pig™ provided the non-database form of
tier architecture with even a high performing
technology as the way to solve problems with
database at one end is not enough to analyze
massive data. It used distributed architecture
and crunch such colossal information in desired
as the foundation to remove the constraints of
time frames. The fastest databases today are
traditional constructs.
benchmarked at tera bytes of information as
Both Data (storage, transportation) and
noted by the transaction processing council
Processing (analysis, conversion, formatting)
Volumes of exa and zetta bytes of data need a
are distributed in this architecture. Figure 1 and
different technology. Analysis of unstructured
Figure 2 compare the traditional vs. Distributed
data is another criterion for the evolution of Big
Architecture.
Data Nodes Data Nodes
Validation Enrichment
Processing Nodes
Transformation Strandardization
Data Nodes Data Nodes
Route Processing Nodes
Operate Server Tier
Middle Tier
Distributed Nodes
Client Tier
Figure 1: Conventional Multi-Tier Architecture Source: Infosys Research
Client Tier
Figure 2: Distributed Multi-Nodal Architecture Source: Infosys Research
55
A key advantage of distributed
There are multiple business scenarios
architecture is scalability. Nodes can be added
in which data has to be analyzed in real time.
without affecting the design of the underlying
These data are created, updated and transferred
data structures and processing units.
because of real time business or system level
IBM has even gone a step ahead in getting
events. Since the data is in the form of real time
Watson [5], the famous artificial intelligent
events, this requires a paradigm shift in the
computer which can learn as it gets more
methodology in the way data is viewed and
information and patterns for decision making.
analyzed. Real time data analyses in such cases
Similarly IBM [6], Oracle [7], Teradata
means that data has to be analyzed before the
[8] and many leading software providers
data hits the disk. Difference between ‘event’
have created the Big data methodologies as
and ‘data’ just vanishes.
an impetus to help enterprise information
In such cases across the industry where
management.
Big data is unequivocally needed to manage the data but to use this data effectively and
VELOCITY PROBLEM IN BIG DATA
integrate with real time events and provide
Even though we clearly see the benefits of Big
business with express results, a complimentary
data and its architecture can easily be applicable
technology is required and that’s where CEP
to any industry, there are some limitations that
can fit in.
is not easily perceivable. Few pointers: VELOCITY PROBLEM: CEP AS A SOLUTION ■■ Can Big data help a trader to give the
The need here is the analyses of data arriving
best win scenarios based on millions and
through the form of real time event streams
even billions of computations of multiple
and identifying patterns or trends based on
trading parameters in real time?
vast historical data. Adding to the complexity is other real time events.
■■ Can Big data forecast traffic scenarios
The vastness is solved with Big data and
based on sensor data, vehicle data,
real time analysis of multiple events, pattern
seasonal change, major public events and
detection and appropriate matching and
provide alternate path to drivers through
crunching is solved by CEP.
their GPS devices in real time helping both
Real time event analysis ensures avoiding
city officials as well as drivers to save time?
duplicates and synchronization issues as data is still in flight and storage is still a step away.
■■ Can Big data detect fraud detection
Similarly it facilitates predictive analysis of data
scenarios running through multiple
by means of pattern matching and trending.
shopping patterns of a user through
This enables enterprise to provide early
historical data and match with the
warning signals and take corrective measures
current transaction in real time?
in real time itself. Reference architecture of traditional CEP
■■ Can Big data provide real time analytical
is shown in Figure 3.
solutions out of the box and support
CEP’s original objective was to provide
predictive analytics?
processing capability similar to Big data with
56
Dev., Business User Tools (Platform Independent)
Feature Set Debug Capability Standard Functions Multi User Support Language Constucts
Event Generation and Capture
Event Event Catalog Originator Domain Object Model model Catalog Catalog
Refine
Event Handlers
Aggregate and correlate
Visualize
Event Processing Engine
Event Processing and Logic
Actions
Patterns
Event Streams
Event Pre-filtering
Performance
Preprocessing
Domain Specific Algorithms
Patterns
User Roles
CEP Languages
Monitoring and Administration Tools
Event Modeling and Management
Security and Authentication
Security and Search
Scalability
Memory Management
Storage Options
Relationships
Failure and Recovery
Persistence Models
Event Attributes
Access Management
Meta Data Repository
Event Access
Event Consumer
Figure 3: Complex Events ProcessingReference Architecture
Source: Infosys Research
distributed architecture and in memory grid
the in memory data grid and every new event
computing. The difference was CEP was to
(transactions) from the customer is analyzed
handle multiple events seemingly unrelated
by CEP engine by correlating and applying
and correlate them to provide a desired and
patterns on the event data with the historic data
meaningful output. The backbone of CEP
stored in the memory grid.
though can be the traditional architectures such
There are multiple scenarios some
multi-tier technologies with CEP usually in the
of them outlined through this paper where
middle tier.
CEP complements Big data and other offline
Figure 4 shows how the CEP on Big
analytical approaches to accomplish an active
data solves the velocity problem with Big
and dynamic event analytics solution.
data and complements the overall information management strategy for any enterprise that
EVENT CLOUDS AND DETECTION
aims to use Big data. CEP can utilize Big data
TECHNIQUES
particularly by highly scalable in-memory data
CEP and Event Clouds
grids to store the raw feeds, events of interests
A linearly ordered sequence of events is called
and detected events and analyze this data in real
an event stream [9]. An event stream may
time by correlating with other in flight events.
contain many different types of events, but
Fraud detection is a very apt example where
there must be some aspect of the events in the
historic data of the customer’s transaction,
event stream that allow for a specific ordering.
his usage profile, location, etc., is stored in
This is typically an ordering via timestamp.
57
Dev., Business User Tools (Platform Independent)
Feature Set Debug Capability Standard Functions Multi User Support Language Constucts
Event Generation and Capture
Event Event Catalog Originator Domain Object Model model Catalog Catalog Meta Data Repository
Event Access Persistence Models
Event Attributes Storage Options
Relationships
Scalability
Security and Search
Event Modeling and Management
CEP Languages Preprocessing
Domain Specific Algorithms
Patterns
Aggregate and correlate
Refine
Event Handlers
Visualize
Event Processing Engine
Event Processing and Logic
Actions
Patterns
Event Streams
Query Agent
In Memory DB or Data Grid
Write Connector Big Data
Dashboard
Event Consumer
Figure 4: CEP on Big Data
Source: Infosys Research
By watching for Event patterns of interest, such
organization, such as stock market trades or
as multiple usages of the same credit card at
tweets from a particular twitter user. Event
a gas station within a 10 minute window, in
Clouds and event streams may have business
an event stream, systems can respond with
events, operational events, or both. Strictly
predefined business driven behaviors, such as
speaking, an event stream is an Event Cloud,
placing a fraud alert on the suspect credit card.
but an Event Cloud may or may not be an event
An Event Cloud is “a partially ordered
stream, as dictated by the ordering requirement.
set of events (POSET), either bounded or
Typically, a landscape with CEP
unbounded, where the partial orders are imposed
capabilities will include three logical units:
by the causal, timing and other relationships
(i) emitters that serve as sources of events, (ii)
between events” [10]. As such, it is a collection of
a CEP engine, and (iii) targets to be notified
events within which the ordering of events may
under certain event conditions. Sources can
not be possible. Further, there may or may not
be anything from an application to a sensor to
be an affinity of the events within a given Event
even the CEP engine itself. CEP engines, that
Cloud. If there is an affinity, it may be as broad
are the heart of the system, are implemented
as “all events of interest to our company” or as
in one of two fundamental ways. Some follow
specific as “all events from the emitters located
the paradigm of being rules based, matching on
at the back of the building.”
explicitly stated event patterns using algorithms
Event Clouds and event streams may
like Rete, while other CEP engines use the more
contain events from sources outside of an
sophisticated event analytics approach looking
58
for probabilities of event patterns emerging using
However, by adding the Event Cloud, or event
techniques like Bayesian Classifiers [11]. In either
stream, to the pool of elements being observed,
case of rules or analytics, some consideration of
emergent patterns not previously considered
what is of interest must be identified up front.
can be brought to light. This is the crux of this
Targets can be anything from dashboards to
paper, using the Event Cloud as a porthole into
applications to the CEP engine itself.
unconsidered situations emerging.
Users of the system, using the tools provided by the CEP provider, articulate events
EVENT CLOUDS HAVE FORM
and patterns of events that they are interested
As represented in Figure 5 , there is a
in exploring, observing, and/or responding to.
point wherein events flowing through a CEP
For example, a business user may indicate to
engine are unprocessed. This point is an Event
the system that for every sequence wherein a
Cloud, which may or may not be physically
customer asks about a product three times but
located within a CEP engine memory space.
does not invoke an action that results in a buy,
This Event Cloud has events entering its
the system is then to provide some promotional
logical space and leaving it. The only bias to
material to the customer in real-time. As another
the events travelling through the CEP engine’s
example, a technical operations department
Event Cloud is based on which event sources
may issue event queries to the CEP engine,
are serving as inputs to the particular CEP
in real time, asking about the number of
engine. For environments wherein all events,
server instances being brought online and
regardless of source, are sent to a common
the probability that there may be a deficit in
CEP engine, there is no bias of events within
persistence storage to support the servers.
the Event Cloud.
Focusing on events, while extraordinarily
There are a number of attributes about
powerful, biases what can be cognized. That
the Event Cloud that can be captured, depending
is, what you can think of, you can explore.
upon a particular CEP’s implementation.
What you can think of, you can respond to.
For example, if an Event Cloud is managed
Input Input Adapter Adapter Input Input Adapter Adapter
Output Adapter
Filter Filter Event Event Cloud Cloud
Union Union Apply Apply Rules Rules Correlate Correlate Match Match
Figure 5: CEP Engine Components
Source: Infosys Research
59
Output Bus Bus Output
Input Input Adapter Adapter
Event Ingress Ingress Bus Bus Event
Input Input Adapter Adapter
Output Adapter
Output Adapter
Event Cloud
Event S
Event Cloud Steady State Shift Buy
Event M
Ask
Event A
Ask Ask
Buy
Buy
Event A
Buy Ask
Look
Event M
Ask
Ask Buy
Event S Look Buy
Event A
Ask Ask
Event M Event Cloud Steady State Form
Event A
Event Cloud New Form
Figure 6: Event Cloud (The Events traversing an Event Cloud at any particular moment give it shape and size) Source: Infosys Research
Figure 7: Event Cloud Shift (Shape shifts as new patterns occur) Source: Infosys Research
in memory and is based on a time window,
causes an Event Cloud’s shape to shift away
for e.g., events of interest only stay within
from its steady state, a situation change has
consideration by the engine for a period of
occurred Figure 7. When these steady state
time, then the number of events contained
deviations happen, and if no new matching
within an Event Cloud can be counted. If the
patterns or rules are being invoked, then an
structure holding an Event Cloud expands
unknown-unknown may have emerged. That
and contracts with the events it is funneling,
is, something significant enough to adjust your
then the memory footprint of the Event Cloud
systems operating characteristics has occurred
can be measured. In addition to the number of
yet isn’t being acknowledged in some way.
events and the memory size of the containing
Either it has been predicted but determined
unit, the counts of the event types themselves
to not be important, or it was simply not
that happen to be present at a particular time
considered.
within the Event Cloud become a measurable characteristic. These properties, viz., memory
ANOMALY DETECTION APPLIED TO
size, event counts, and event types, can serve as
EVENT CLOUD STEADY STATE SHIFTS
measurable characteristics describing an Event
Finding patterns in data that do not match
Cloud, giving it a size and shape Figure 6.
a baseline pattern is the realm of anomaly detection. As such, by using the steady state of
EVENT CLOUD STEADY STATE
an Event Cloud as the baseline we can apply
The properties of an Event Cloud that give
anomaly detection techniques to discern a shift.
it form can be used to measure its state. By
Table 1 presents a catalog of various
collecting its state over time, a normative
anomaly detection techniques that are applicable
operating behavior can be identified and its
to Event Cloud shift discernment. This list isn’t
steady state can be determined. This steady
to serve as an exhaustive compilation, but
state is critical when watching for unpredicted
rather to showcase the variety of possibilities.
patterns. When a new flow pattern of events
Each algorithm has its own set of strengths
60
Technique Classification
Example Constituent Techniques
Event Cloud Shift Applicability Challenges
Neural Networks | Bayesian Networks
Accurately labeledtraining data for the
Support Vector Machines Rule
classifiers is difficult to obtain
Nearest Neighbour Based Clustering Based
Distance to kth Nearest Neighbour Relative Density
Defining meaningful distance measures
Statistical
Parametric | Non-Parametric
Histogram approaches miss unique combinations
Spectral
Low Variance PCA Eigenspace - Based
High computational complexity
Classification Based
difficult
Table 1: Applicability of Anomaly Detection Techniques to Event Cloud Steady State Shifts
Source: Derived from Anomaly Detection: A survey [12]
such as simplicity, speed of computation, and
Further, knowing an Event Cloud’s steady state
certainty scores. Each algorithm, likewise, has
shape a priori isn’t assumed, so the use of a non-
weaknesses to include computational demands,
parametric statistical model is appropriate [13].
blind spots in data deviations, and difficulty in
Therefore, the technique of statistical profiling
establishing a baseline for comparison. All of
using histograms is explored as an example
these factors must be considered when selecting
implementation approach for catching a steady
an appropriate algorithm.
state shift.
Using the three properties defined for an
One basic approach to trap the moment
Event Cloud’s shape (for e.g., event counts, event
of an Event Cloud’s steady state shift is to
types, and Event Cloud size) combined with
leverage a histogram based on each event type,
time properties, we have a multivariate data
with the number of times a particular count of
instance with three of them being continuous
an event type shows up in a given Event Cloud
types, viz., counts, sizes, and time and one being
instance becoming a basis for comparison. The
categorical, viz., types. These four dimensions,
histogram generated over time would then
and their characteristics, become a constraint
serve as the baseline steady state picture of
on which anomaly detection algorithms can be
normative behavior. Individual instances of an
applied [13].
Event Cloud’s shape could then be compared
The anomaly type being detected is
to the Event Cloud’s steady state histogram to
also a constraint. In this case, the Event Cloud
discern if a deviation has occurred. That is, does
deviations are being classified as collective
the particular Event Cloud instance contain
anomaly. It is collective anomaly, as opposed
counts of events that have rarely, or never,
to point anomaly or context anomaly as we are
appeared in the Event Cloud’s history.
comparing a collection of data instances that
Figure 8 represents the case with a
form the Event Cloud shape with a broader
steady state histogram on the left, and the Event
set of all data instances that formed the Event
Cloud comparison instance on the right. In this
Cloud steady state shape.
depiction the histogram shows, as an example,
Statistical algorithms lend themselves
that three Ask Events were contained within
well to anomaly detection when analyzing
an Event Cloud instance exactly once in the
continuous and categorical data instances.
history of this Event Cloud. The Event Cloud
61
a single observer detecting when an Event
Event Cloud Histogram and Instance Comparison
Cloud deviates from steady state, a system
A
could have multiple observers, each with their
A
1 2 3 1
2 3 1
2 3
Look Ask Buy Event (s) Event (s) Event (s) Event Cloud Steady State Histogram
A
B
own techniques and approaches applied. Their
A
B
A
B
individual results could then be aggregated,
A
B
with varying weights applied to each technique, to render a composite Event Cloud steady state
Buy Look Event Event Event Cloud Comparison Instance
Ask Event
shift score. This will help remove the chances of missing a state change shift. With the approach outlined by this
Figure 8: Event Cloud Histogram & Comparison Source: Infosys Research
paper, the scope of indicators is such that you get an early indicator that something new is emerging and nothing more. Noticing an Event
instance, on the right, that will be compared
Cloud shift only indicates that a situational
shows that the instance has six Ask Events in
change has occurred; it does not identify or
its snap shot state.
highlight what the root cause of the change is,
An anomaly score for each event type
nor does it fully explain what is happening.
is calculated, by comparing each Event Cloud
Analysis is still required to determine what
instance event type count to the event type
initiated the shift along with what opportunities
quantity occurrence bins within the Event
for exploitation may be present.
Cloud steady state histogram, and then these individual scores are combined for an aggregate
FURTHER RESEARCH
score [13]. This aggregate score then becomes the
Many enterprise CEP implementations are
basis upon which a judgment is made regarding
architected in layers, wherein event abstraction
a whether deviation has occurred or not.
hierarchies, event pattern maps and event
While simple to implement, the primary
processing networks are used in concert to
weakness of using the histogram based
increase the visibility aspects of the system [14]
approach is that a rare combination of events
as well as to help with overall performance by
in an Event Cloud would not be detected, if the
allowing for the segmenting of Event flows.
quantities of the individual events present were
In general, each layer going up the hierarchy
in their normal or frequent quantities.
is an aggregation of multiple events from its immediate child layer. With the lowest layer
LIMITATIONS OF EVENT CLOUD SHIFTS
containing the finest grained events and the
Anomaly detection algorithms have blind
highest layer containing the coarsest grained
spots, or situations where they cannot discern
events, the Event Clouds that manifest at
an Event Cloud shift. This implies that it is
each layer are likewise of varying granularity
possible for an Event Cloud to shift undetected,
(Figure 9). Therefore a noted Event Cloud
under just the right circ*mstances. However,
steady state shift at the lowest layer represents
following the lead suggested by Okamoto
the finest granularity shift that can be observed.
and Ishida with immunity-based anomaly
An Event Cloud’s steady state shifts at the
detection systems [13], rather than having
highest layer represent the coarsest steady
62
data analysis. CEP though designed purely for
CEP In Layers
events complements the Big data strategy of
Event Clouds
any enterprise.
Events TH AN
TH
S S
A
M
A
S
M
Event Cloud, a constituent component
AN
S
of CEP can be used for more than its typical application. By treating it as a first class citizen
M S
of indicators, and not just a collection point
M
computing construct, a company can gain
A S A
insight into the early emergence of something new, something previously not considered
Figure 9: Event Hierarchies Source: Infosys Research
and potentially the birthing of an unknownunknown.
state shifts that can be observed. Techniques
With organizations growing in their
for interleaving individual layer Event Cloud
usage of Big data, and the desire to move closer
steady state shifts along with opportunities
to real time response, companies will inevitably
and consequences of their mixed granularity
leverage the CEP paradigm. The question
can be explored.
will be do they use it as everyone else does,
The technique presented in this paper
triggering off of conceived patterns, or will they
is designed to capture the beginnings of
exploit it for unforeseen situation emergence?
a situational change not explicitly coded
When the situation changes, the capability is
for. With the recognition of a new situation
present and the data is present, but are you?
emerging, the immediate task is to discern what is happening and why, while it is unfolding.
REFERENCES
Further research can be done to discern which
1. WSJ article on Big data. Available at
elements available from the steady state
http://online.wsj.com/article/SB1000
shift automated analysis would be of value
0872396390443890304578006252019616
to help an analyst — business or technical
768.html.
-- unravel the genesis of the situation change.
2. T r a n s a c t i o n P r o c e s s i n g C o u n c i l
By discovering what change information is of
Benchmark comparison or leading
value, not only can an automated alert be sent
databases. Available at http://www.tpc.
to interested parties, but it can contain helpful
org/tpcc/results/tpcc_perf_results.asp.
clues on where to start their analysis.
3. T r a n s a c t i o n P r o c e s s i n g C o u n c i l Benchmark comparison or leading
CONCLUSION
databases. Available at http://www.tpc.
It would be an understatement that without the
org/tpcc/results/tpcc_perf_results.asp.
right set of systems, methodologies, controls,
4. Apache Hadoop project site. Available
checks and balances on data, no enterprise can
at http://hadoop.apache.org/.
survive. Big data solves the problem of vastness
5. IBM Watson – Artificial intelligent super
and multiplicity of the ever rising information
computer’s Home Page. Available at
in this information age. What Big data does not
http://www-03.ibm.com/innovation/
fulfill is the complexity associated with real time
us/watson/.
63
6. IBM’s Big data initiative. Available at
13. Okamoto, T. and Ishida, Y. (2009), An
http://www-01.ibm.com/software/
Immunity-Based Anomaly Detection
data/bigdata/.
System with Sensor Agents, sensor ISSN
7. Oracle’s Big data initiative. Available
1424-8220.
at http://www.oracle.com/us/
14. Luckham, D. (2002), The Power of
technologies/big-data/index.html.
Events, An Introduction to Complex
8. Teradata Big data Analytics offerings.
Event Processing in Distributed
Available at http://www.teradata.com/
Enterprise Systems, Addison Wesley,
business-needs/Big-Data-Analytics/.
Boston.
9. Luckham, D. and Schulte, R. (2011),
15. Vincent, P. (2011), ACM Overview
Event Processing Glossary – Version 2.0,
of BI Technology misleads on CEP.
Compiled. Available at http://www.
Available at http://www.thetibcoblog.
complexevents.com/2011/08/23/event-
com/2011/07/28/acm-overview-of-bi-
processing-glossary-version-2-0/.
technology-misleads-on-cep/.
10. Bass, T. (2007), What is Complex Event
16. About Esper and NEsper FAQ, http://
Processing? TIBCO Software Inc.
esper.codehaus.org/tutorials/faq_
11. B a s s , T . ( 2 0 1 0 ) , O r w e l l i a n E v e n t
esper/faq.html#what-algorithms.
Processing. Available at http://www.
17. I d e , T . a n d K a s h i m a , H . ( 2 0 0 4 ) ,
thecepblog.com/2010/02/28/orwellian-
Eigenspace-based Anomaly Detection
event-processing/.
in Computer Systems, Tenth ACM
12. Chandola, V., Banerjee, A., and Vipin
SIGKDD International Conference on
Kumar, V. (2009), Anomaly Detection :
Knowledge Discovery and Data Mining,
A Survey, ACM Computing Surveys.
August pp. 22-25.
64
Infosys Labs Briefings VOL 11 NO 1 2013
Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja
Validate data quality by employing a structured testing technique
T
esting Big data is one of the biggest
structured and unstructured data validation,
challenges faced by organizations because
data storage validation are important to ensure
of lack of knowledge on what to test and how
that the data is correct and is of good quality.
much data to test. Organizations have been
Apart from functional validations other non-
facing challenges in defining the test strategies
functional testing like performance and failover
for structured and unstructured data validation,
testing plays a key role to ensure the whole
setting up an optimal test environment, working
process is scalable and is happening within
with non-relational databases and performing
specified SLA.
non-functional testing. These challenges are
Big data implementation deals with
causing in poor quality of data in production
writing complex Pig, Hive programs and
and delayed implementation and increase in
running these jobs using Hadoop map reduce
cost. Robust testing approach need to be defined
framework on huge volumes of data across
for validating structured and unstructured
different nodes. Hadoop is a framework that
data and start testing early to identify possible
allows for the distributed processing of large
defects early in the implementation life cycle
data sets across clusters of computers. Hadoop
and to reduce the overall cost and time to
uses Map/Reduce, where the application is
market.
divided into many small fragments of work,
Different testing types like functional
each of which may be executed or re-executed
and non-functional testing are required along
on any node in the cluster. Hadoop utilizes its
with strong test data and test environment
own distributed file system, HDFS, which makes
management to ensure that the data from varied
data available to multiple computing nodes.
sources is processed error free and is of good
Figure 1 shows the step by step process
quality to perform analysis. Functional testing
on how Big data is processed using Hadoop
activities like validation of map reduce process,
ecosystem. First step loading source data into
65
1 Loading Source
data files into HDFS
2 Perform Map
Reduce operations
3 Extract the output results from HDFS
Figure 1: Big Data Testing Focus Areas
Source: Infosys Research
HDFS involves in extracting the data from
Testing should be performed at each of
different source systems and loading into
the three phases of Big data processing to
HDFS. Data is extracted using crawl jobs for
ensure that data is getting processed without
web data, tools like sqoop for transactional
any errors. Functional Testing includes (i)
data and then loaded into HDFS by splitting
validation of pre-Hadoop processing; (ii),
into multiple files. Once this step is completed
validation of Hadoop Map Reduce process
second step perform map reduce operations
data output; and (iii) validation of data
involves in processing the input files and
extract, and load into EDW. Apart from these
applying map and reduce operations to get a
functional validations non-functional testing
desired output. Last setup extract the output
including performance testing and failover
results from HDFS involves in extracting the
testing needs to be performed.
data output generated out of second step and
Figure 2 shows a typical Big data
loading into downstream systems which can
architecture diagram and highlights the areas
be enterprise data warehouse for generating
where testing should be focused.
analytical reports or any of the transactional systems for further processing
Validation of Pre-Hadoop Processing Data from various sources like weblogs, social
BIG DATA TESTING APPROACH
network sites, call logs, transactional data
As we are dealing with huge data and executing
etc., is extracted based on the requirements
on multiple nodes there are high chances of
and loaded into HDFS before processing it
having bad data and data quality issues at each
further.
stage of the process. Data functional testing is performed to identify these data issues because
Issues: Some of the issues which we face during
of coding errors or node configuration errors.
this phase of the data moving from source
66
25% 25% 25% 25%
Big Data Analytics
1
2
3
4
5
Bar graph
4
Enterprise Data Warehouse
ReportsTesting 3
hadoop
Processed Data
1
Pig
ETL Process validation
HIVE
HBase (NoSQL DB)
Map Reduce (Job Execution)
ETL Process 2
Map-Reduce process validation
HDFS (Hadoop Distributed File System)
Pre-Hadoop process validation Web Logs
Data Load using Sqoop Streaming Data
Social Data
Transactional Data (RDBMS)
Non-FunctionalT esting (Performance, Fail over testing)
Big Data Testing Focus Areas Reporting using BI Tools
4
Figure 2: Big Data architecture
Source: Infosys Research
systems to Hadoop are incorrect data captured
Validation of Hadoop Map Reduce Process
from source systems, incorrect storage of data, incomplete or incorrect replication.
Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data
Validations: S o me h i g h l e v e l s ce n a r i os
coming from different sources.
that need to be validated during this phase include:
Issues: Some issues that we face during this phase of the data processing are coding issues
1. Comparing input data file against
in map-reduce jobs, jobs working correctly
source systems data to ensure the data
when run in standalone node, but working
is extracted correctly
incorrectly when run on multiple nodes, incorrect aggregations, node configurations,
2. Validating the data requirements and
and incorrect output format.
ensuring the right data is extracted, Validations: Some high level scenarios that 3. Validating that the files are loaded into
need to be validated during this phase
HDFS correctly, and
include:
4. Validating the input files are split,
1. V a l i d a t i n g t h a t d a t a p r o c e s s i n g
moved and replicated in different data
is completed and output file is
nodes.
generated
67
2. V a l i d a t i n g t h e b u s i n e s s l o g i c o n
3. Validating the data load in target system
standalone node and then validating after running against multiple nodes
4. Validating the aggregation of data
3. Validating the map reduce process to
5. Validating the data integrity in the target
verify that key value pairs are generated
system.
correctly Validation of Reports 4. V a l i d a t i n g t h e a g g r e g a t i o n a n d
Analytical reports are generated using reporting
consolidation of data after reduce
tools by fetching the data from EDW or running
process
queries on Hive.
5. Validating the output data against
Issues: Some of the issues faced while generating
the source files and ensuring the data
reports are report definition not set as per the
processing is completed correctly
requirement, report data issues, layout and format issues.
6. Validating the output data file format and ensuring that the format is per the
Validations: Some high level validations
requirement.
performed during this phase include:
Validation of Data Extract, and Load into EDW
Reports Validation: Reports are tested after
Once map-reduce process is completed and data
ETL/transformation workflows are executed for
output files are generated, this processed data
all the sources systems and the data is loaded
is moved to enterprise data warehouse or any
into the DW tables. The metadata layer of the
other transactional systems depending on the
reporting tool provides an intuitive business
requirement.
view of data available for report authoring. Checks are performed by writing queries to
Issues: Some issues that we face during this
verify whether the views are getting the exact
phase include incorrectly applied transformation
data needed for the generation of the reports. Cube Testing: Cubes are testing to verify
rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS.
that dimension hierarchies with pre-aggregated values are calculated correctly and displayed
Validations: Some high level scenarios that
in the report. Dashboard Testing: Dashboard testing
need to be validated during this phase include:
consists of testing of individual web parts and 1. Validating that transformation rules are
reports placed in a dashboard. Testing would
applied correctly
involve ensuring all objects are rendered properly and the resources on the webpage
2. Validating that there is no data corruption by
are current and latest. The data fetched from
comparing target table data against HDFS
various web parts is validated against the
files data
databases.
68
VOLUME, VARIETY AND VELOCITY:
100% data comparison. To reduce the time for
HOW TO TEST?
execution we can either run all the comparison
In the earlier sections we have seen step by step
scripts in parallel on multiple nodes just like
details on what need to be tested at each phase
how data is processed using Hadoop map-
of the Big data processing. During these phases
reduce process or sample the data ensuring
of Big data processing the three dimensions or
maximum scenarios are covered.
characteristics of Big data i.e. volume, variety
Figure 3 shows the approach on how
and velocity are validated to ensure there are no
voluminous amount of data is compared. Data
data quality defects and no performance issues.
is converted into expected result format and then compared using compare tools with actual
Volume: The amount of data created both inside
data. This is a faster approach but involves
corporations and outside the corporations via
initial scripting time. This approach will reduce
the web, mobile devices, IT infrastructure,
further regression testing cycle time. When
and other sources is increasing exponentially
we don’t have time to validate complete data,
each year [3]. Huge volume of data flows from
sampling can be done for validation.
multiple systems which need to be processed and analyzed. When it comes to validation it is
Variety: The variety of data types is increasing,
a big challenge to ensure that whole data setup
namely unstructured text-based data and semi-
processed is correct. Manually validating the
structured data like social media data, location-
whole data is a tedious task. We should use
based data, and log-file data.
compare scripts to validate the data. As data
Structured Data is data which is in
is stored in HDFS is in file format scripts can
defined format which is coming from different
be written to compare two files and extract the
RDBMS tables or from structured files. The
differences using compare tools [4]. Even if we
data that is of transactional nature can be
use compare tools it will take a lot of time to do
handled in files or tables for validation purpose.
Testing Scripts to validate data in HDFS
Map Reduce Jobs run in test environment to generate the output
Output Data Files
Raw Data to Expected Results format
(data) “testing”
Structured data 1 SD Test 2 SD1 Test1
Unstructured (data) “testing”
Structured data testing
Tool to compare the files
Unstructured
Unstructured to Structured
Map Reduce Jobs
Actual Results
Custom scripts to convert unstructured data to structured data
Scripts to convert data to expected results data File by File Comparison Discrepancy Report
Figure 3: Approach for High Volume Data Validation
Source: Infosys Research
69
Structured data testing
Expected Results
Expected Results
Semi-structured data does not have any defined
important role to identify any performance
format but structure can be derived based on the
bottleneck in the system and the system can
multiple patterns of the data. Example of data is
handle high velocity streaming data.
extracted by crawling through different websites for analysis purposes. For validation data need
NON-FUNCTIONAL TESTING
to be first transformed into structured format
In the earlier sections we have seen how
using custom built scripts. First the pattern
functional testing is performed at each phase of
need to be identified and then copy books
Big data processing, these tests are performed to
or pattern outline need to be prepared, later
identify functional coding issues, requirements
this copy book need to be used in scripts to
issues. Performance testing and failover testing
convert the incoming data into a structured
need to be performed to identify performance
format and then validations performed using
bottlenecks and to validate the non-functional
compare tools.
requirements.
Unstructured data is the data that does not have any format and is stored in
Performance Testing: Any Big data project
documents or web content, etc. Testing
involves in processing huge volumes of
unstructured data is very complex and is time
structured and unstructured data and is
consuming. Automation can be achieved to
processed across multiple nodes to complete
some extent by converting the unstructured
the job in less time. At times because of
data into structured data using scripting like
bad architecture and poorly designed code,
PIG scripting as showing in Figure 3. But
performance is degraded. If the performance
the overall coverage using automation will
is not meeting the SLA, the purpose of setting
be very less because of unexpected behavior
up Hadoop and other Big data technologies is
of data; input data can be in any form and
lost. Hence, performance testing plays a key role
changes every time new test is performed. We
in any Big data project due to huge volume of
need to deploy a business scenario validation
data and complex architecture.
strategy for unstructured data. In this strategy
Some of the areas where performance
we need to identify different scenarios that
issues can occur are imbalance in input splits,
can occur in our day to day unstructured data
redundant shuffle and sorts, moving most
analysis and test data need to be setup based
of the aggregation computations to reduce
on test scenarios and executed.
process which can be done at map process. [5]. These performance issues can be eliminated
Velocity: The speed at which new data is being
by carefully designing the system architecture
created – and the need for real-time analytics
and doing performance test to identify the
to derive business value from it -- is increasing
bottlenecks.
thanks to digitization of transactions, mobile
Performance testing is conducted
computing and the sheer number of internet
by setting up huge volume of data and an
and mobile device users. Data speed needs
infrastructure similar to production. Utilities
to be considered when implementing any
like Hadoop performance monitoring tool can
Big data appliance to overcome performance
be used to capture the performance metrics and
problems. Performance testing plays an
identify the issues. Performance metrics like
70
Key steps involved in setting up environment
job completion time, throughput, and system
on cloud are [6]:
level metrics like memory utilization etc. are captured as part of performance testing.
A. Big data Test infrastructure requirement assessment
Failover Testing: Hadoop architecture consists of a name node and hundreds of data notes hosted on several server machines and each
1. A s s e s s t h e B i g d a t a p r o c e s s i n g
of them are connected. There are chances of
requirements
node failure and some of the HDFS components become non-functional. Some of the failures can
2. Evaluate the number of data nodes
be name node failure, data node failure and
required in QA environment
network failure. HDFS architecture is designed to detect these failures and automatically
3. U n d e r s t a n d t h e d a t a p r i v a c y
recover to proceed with the processing.
requirements to evaluate private or
Failover testing is an important focus
public cloud
area in Big data implementations with the objective of validating the recovery process
4. Evaluate the software inventory required
and to ensure the data processing happens
to be setup on cloud environment
seamlessly when switched to other data nodes.
(Hadoop, File system to be used, No
Some validations that need to be
SQL DBs, etc).
performed during failover testing are validating that checkpoints of edit logs and FsImage
B. Big data Test infrastructure design
of name node are happening at a defined intervals, recovery of edit logs and FsImage
1. Document the high level cloud test
files of name node, no data corruption because
infrastructure design (Disk space, RAM
of the name node failure, data recovery when
required for each node, etc.)
data node fails and validating that replication is initiated when one of data node fails or data
2. Identify the cloud infrastructure service
become corrupted. Recovery Time Objective
provider
(RTO) and Recovery Point Objective (RPO) metrics are captured during failover testing.
3. Document the SLAs, communication plan, maintenance plan, environment refresh plan
TEST ENVIRONMENT SETUP As Big data involves handling huge volume and processing across multiple nodes, setting
4. Document the data security plan
up a test environment is the biggest challenge. Setting up the environment on cloud will
5. Document high level test strategy,
give us the flexibility to setup and maintain it
testing release cycles, testing
during test execution. Hosting the environment
types, volume of data processed
on the cloud will also help in optimizing the
by Hadoop, third party tools
infrastructure and faster time to market.
required.
71
C. Big data Test Infrastructure Implementation
functional and non-functional requirements.
and Maintenance
Applying right test strategies and following best practices will improve the testing quality
■■ Create a cloud instance of Big data test
which will help in identifying the defects early
environment
and reduce overall cost of the implementation. It is required that organizations invest in building
■■ Install Hadoop, HDFS, MapReduce and other
skillset both in development and testing. Big
software as per the infrastructure design
data testing will be a specialized stream and testing team should be built with diverse skillset
■■ Perform a smoke test on the environment
including coding, white-box testing skills and
by processing a sample map reduce,
data analysis skills for them to perform a better
Pig/Hive jobs
job in identifying quality issues in data.
■■ Deploy the code to perform testing.
REFERENCES 1. Big data overview, Wikipedia.org at http://en.wikipedia.org/wiki/Big_data.
BEST PRACTICES Data Quality: It is very important to establish
2. White, T. (2010), Hadoop- The Definitive
the data quality requirements for different
Guide 2nd Edition, O’Reilly Media.
forms of data like traditional data sources, data
3. Kelly, J. (2012), Big data: Hadoop,
from social media, data from sensors, etc. If the
Business Analytics and Beyond, A
data quality is ascertained, the transformation
Big data Manifesto from the Wikibon
logic alone can be tested, by executing tests
Community. Available at http://
against all possible data sets.
wikibon.org/wiki/v/Big_Data:_ Hadoop,_Business_Analytics_and_
D a t a S a m p l i n g : Data sampling gains
Beyond, Mar 2012.
significance in Big data implementation and
4. Informatica Enterprise Data Integration
it becomes the testers’ job to identify suitable
(1998), Data verification using File and
sampling techniques that includes all critical
Table compare utility for HDFS and Hive
business scenarios and the right test data set.
tool. Available at https://community. informatica.com/solutions/1998.
Automation: Automate the test suites as much
5. Bhandarkar M. (2009), Practical Problem
as possible. The Big data regression test suite
Solving with Hadoop, USENIX ‘09
will be used multiple times as the database will
annual technical conference, June 2009.
be periodically updated. Hence an automated
Available at http://static.usenix.org/
regression test suite should be built to use it
event/usenix09/training/tutonefile.html.
after reach release. This will save a lot of time
6. N a g a n a t h a n , V . ( 2 0 1 2 ) , I n c r e a s e
during Big data validations.
Business Value with Cloud-based QA Environments, Available at http://www.
CONCLUSION
infosys.com/IT-services/independent-
Data quality challenges can be encountered by
validation-testing-services/Pages/
deploying a structured testing approach for both
cloud-based-QA-environments.aspx.
72
Infosys Labs Briefings VOL 11 NO 1 2013
Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash
Reconstruct self-organizing maps as spider graphs for better visual interpretation of large unstructured datasets
E
xponential growth of data capturing devices
A novel approach in unsupervised
has led to an explosion of data available.
machine learning is Self-Organizing Maps
Unfortunately not all data available is in the
(SOM). Along with classification, SOMs
database friendly format. Data which cannot
have added benefit of dimensionality
be easily categorized, classified or imported
reduction. SOMs are also used for visualizing
into database are termed Unstructured Data.
multidimensional data into 2D planar diffusion
Unstructured data is ubiquitous and is assumed
map. This achieves data reduction thus
to be around 80% of all data generated [1].
enabling visualization of large datasets.
While tremendous advancements have taken
Present models used to visualize SOM
place for analyzing, mining and visualizing
maps lack any deductive ability that may be
structured data, the field of unstructured
defeating the power of SOM. We introduce
data, especially unstructured Big data is still
better restructuring of SOM trained data for
in nascent stage.
more meaningful interpretation of very large
Lack of recognizable structure and
data sets.
huge size makes it very challenging to work
Taking inspiration from the nature,
with unstructured large datasets. Classical
we model the large unstructured dataset into
visualization methods limit the amount of
spider cobweb type graphs. This has the benefit
information presented and are asymptotically
of allowing multivariate analysis as different
slow with rising dimensions of the data. We
variables can be presented into one spider
present here a model to mitigate these problems
graph and their inter-variable relations can be
and allow efficient and vast visualization of
projected, which cannot be done with classical
large unstructured datasets.
SOM maps.
73
UNSTRUCTURED DATA
faster, may be quantum computing someday, it
Unstructured data come in different formats
promises greater role for the data. While there
and sizes. Broadly the textual data, sound,
has been a lot of effort to bring some structure
video, images, webpages, logs, emails, etc., are
into unstructured data [6], the cost of doing so
categorized into unstructured data. In some
has been the hindrance. With larger datasets
cases even a bundle of numeric data could
it is even a greater problem as it entails more
be collectively unstructured, for e.g., health
randomness and unpredictability in the data.
records of a patient. While a table of cholesterol
Self-Organizing Maps (SOM) are a
level of all the patients is more structured,
class of artificial neural networks proposed by
all the biostats of a single patient is largely
Teuvo Kohonen [7] that transform the input
unstructured.
dataset into two dimensional lattice, also called
Unstructured data could be of any form
Kohonen Map.
and could contain any number of independent variables. Labeling as is done in machine
Structure
learning is only possible with data where
All the points of the input layer are mapped
information of variable such as size, length,
onto two dimensional lattice, called as Kohonen
dependency, precision, etc., is known. Even
Network. Each point in the Kohonen Network
extraction of the underlying information in a
is potentially a Neuron.
cluster of unstructured data is very challenging because it is not known on what is to be extracted [2]. The potential of hidden analytics within the unstructured large datasets could be a valuable asset to any business or research entity. Consider the case of Enron emails (collected and prepared by CALO project). Emails are primarily unstructured, mostly because people often reply above the last email even when the new email’s content and purpose might be different. Therefore most organizations do not analyze emails or logs but several researchers analyzed
Figure 1: Kohonen Network Source: Infosys Research
the Enron emails and their results show that lot of predictive and analytical information could be obtained from the same [3, 4, 5].
Competition of Neurons SELF ORGANIZING MAPS
Once the Kohonen Network is completed the
Ability to harness the increased computing
neurons of the network compete according
power has been a great boon to business.
to the weights assigned from the input layer.
From traditional business analytics to machine
Function used to declare the winning neuron
learning, the knowledge we get from data is
is the simple Euclidean distance of the input
invaluable. With computing forecasted to get
point and its corresponding weight for each of
74
the neuron. The function called as discriminant
Since the formation of topological
function is represented as,
structuring is independent of the input points it can easily be parallelized. Carpenter et.al. have demonstrated the ability of SOM to work under massively parallel processing[9]. Kohonen himself has shown that even where the input data may not be in vector form, as found in some unstructured
where, x = point on Input Layer
data, large scale SOM can be run nonetheless[10].
w = weight of the input point (x)
i = all the input points
SOM PLOTS
j = all the neurons on the lattice
SOM plots are a two dimensional representation
d = Euclidean distance
of the topological structure obtained after training the neural nets for given number of
Simply put, the winning neuron is the one
repetitions and with given radius. The SOM
whose weight is closest (distance in lattice)
can be visualized as a complete 2-D topological
to the input layer. This process effectively
structure [Fig.2].
discretizes the output layer. Cooperation of Neighboring Neurons Once the winning neuron is found, the topological structure can be determined. Similar to the behavior in human brain cells (neurons), the winning neuron also excites its neighbor. Thus the topological structure is determined by the cooperative weights of the winning neuron and its neighbor. Figure 2: SOM Visualization using Rapidminer (AGPL Open Source) Source: Infosys Research
Self-Organization The process of selecting winning neurons and formation of topological structure is adaptive. The process runs multiple times to converge on the best mapping of the given input layer.
Figure 2, shows the overall topological
SOM is better than other clustering algorithms
structure obtained after dimensionality
in that it requires very few repetitions to get to
reduction of multivariate dataset. While
a stable structure.
the graph above may be useful for outlier detection or general categorization it isn’t very useful in analysis of individual variables.
Parallel SOM for large datasets Among all classifying machine learning
Other option of visualizing SOM is to
algorithms, convergence speed of the SOM has
plot different variables in grid format. One
been found to be the fastest [8]. This implies that
can use R programming language (GNU Open
for large data sets SOM is the best viable model.
Source) to plot the SOM results.
75
Figure 3: SOM Visualization in R using the Package ‘Kohonen’ Source: Infosys Research
Figure 4: SOM visualization in R using the package ‘SOM’ Source: Infosys Research
Note on running example
SPIDER PLOTS OF SOM
All the plots presented henceforth have been
As we have seen in the Figures 2, 3 and 4
obtained using R programming language.
the current visualization of SOM output
Dataset used is SPAM Email Database. Database
could be improved for more analytical
is in public domain and freely available for
ability. We introduce a new method to plot
research at ‘UCI Machine Learning Repository’.
SOM output especially designed for large
It contains 266858 word instances of 4601
datasets.
SPAM emails. Emails are good example of unstructured data.
Algorithm
Using the public packages in
1. Filter the results of SOM.
R, we obtain the SOM plots.
2. Make a polygon with as many sides as
Figure 3, is the plot of SOM trained result
the variables in input.
using the package ‘Kohonen’[11]. This plot gives
3. Make the radius of the polygon to
inter-variable analysis. In this case variable
be the maximum of the value in the
being 4 of one the most used words in the SPAM
dataset.
database viz. ‘order’, ‘credit’, ‘free’ and ‘money’.
4. Draw the grid for the polygon.
While this plot is better than topological plot as
5. Make segments inside the polygon
given in Figure 2, it is still difficult to interpret
if the strength of the two variables
the result in canonical sense.
inside the segment is greater than the
Figure 4, is again the SOM plot of
specified threshold.
the above given four most common words
6. Loop Step v for every variable against
in the SPAM database but this one uses the
every other variable
package called ‘SOM’[12]. While this plot is
7. C o l o r t h e s e g m e n t s b a s e d o n t h e
numerical and gives strength of intervariatek
frequency of variable.
relationship it does not help in giving us the
8. Color the line segments based on
analytical picture. The information obtained is
the threshold of each variable pair
not actionable.
plotted.
76
Figure 5: SOM Visualization in R Using the Above Algorithm: Showing Segment, i.e., inter-variable dependency Source: Infosys Research
Figure 7: Spider Plot showing 25 Sampled Words from the Spam Database Source: Infosys Research
Plots
words in the Spam database. The number of
As we can see, this plot is more meaningful than
threads between one variable to another shows
the SOM visualization plots obtained before.
the probability of second variable given the
From the figure we can easily deduce that the
first variable. Several threads between ‘free’
words ‘free’ and ‘order’ do not have similar
and ‘credit’ suggests that Spam emails offering
relation as ‘credit’ and ‘money’. Understandably
‘free credit’ (disguised in other forms by fees or
so, because if a Spam email is selling something,
deferred interests) are among the most popular.
it will probably have the words ‘order’ and
Using these Spider plots we can analyze
conversely if it is advertising any product or
several variables at once. This may cause the
software for ‘free’ download then it wouldn’t
graph to be messy but sometimes we need to see
have the words ‘order’ in it. High relationship
the complete picture in order to make canonical
between ‘credit’ and ‘money’ signifies Spam
decisions about the dataset.
emails advertising for better ‘Credit Score’
From Figure 7 we can see that even
programs and other marketing traps.
though the figure shows 25 variables it is not
Figure 6 shows the relationship of each
as cluttered as a Scatter Plot or Bar chart would
variable-- in this case four popular recurring
be if plotted with 25 variables.
Figure 6: SOM visualization in R using Above Algorithm: Showing Threads, i.e., inter-variable strength) Source: Infosys Research
Figure 8: Uncolored Representation of Threads in Six variables Source: Infosys Research
77
Figure 8 shows the different levels of
http://clarabridge.com/default.
strength between different variables. While
aspx?tabid=137.
‘contact’ variable is strong with ‘need’ but not
2. Doan, A., Naughton, J. F., Ramakrishnan,
enough with ‘help’ it is no surprise that ‘you’
R., Baid, A., Chai, X., Chen, F. and Vuong,
and ‘need’ are strong. Here the idea was only to
B. Q. (2009), Information extraction
present the visualization technique and not the
challenges in managing unstructured
analysis of Spam dataset. For more analysis on
data, ACM SIGMOD Record, vol. 37, no.
Spam filtering and Spam analysis one may refer
4, pp. 14-20.
to several independent works on the same [13, 14].
3. Diesner, J., Frantz, T. L. and Carley, K. M. (2005). Communication networks
ADVANTAGES
from the Enron email corpus “It’s always
There are several visual and non-visual
about the people. Enron is no different”.
advantages of using this new plot against
In Computational & Mathematical
the existing plot obtained. This plot has been
Organization Theory, vol. 11, no. 3, pp.
designed to handle Big data. Most of the existing
201-228.
plots mentioned above are limited in their
4. Chapanond, A., Krishnamoorthy, M.
capacity to scale. Principally if the range of data
S., & Yener, B. (2005), Graph theoretic
is large then most of the existing plots tend to
and spectral analysis of Enron email
get skewed and important information is lost.
data. In Computational & Mathematical
By normalizing the data this new plot prevents
Organization Theory, vol. 11, no.3, pp.
this issue. By allowing multiple dimensions
265-281.
to be incorporated allows for recognition of
5. Peterson, K., Hohensee, M., and Xia, F.
indirect relationships.
(2011), Email formality in the workplace: A case study on the enron corpus.
CONCLUSION
In Proceedings of the Workshop on
While unstructured data is abundant, free and
Languages in Social Media, pp. 86-95.
hidden with information the tools of analyzing
Association for Computational Linguistics.
the same are still nascent and cost of converting
6. Buneman, P., Davidson, S., Fernandez,
them to structured form is very high. Machine
M., and Suciu, D. (1997), Adding
learning is used to classify unstructured data
structure to unstructured data. Database
but comes with issues of speed and space
Theory—ICDT’97, pp. 336-350.
constraints. SOM are the fastest machine
7. Kohonen, T. (1990),The self-organizing
learning algorithms but their visualization
map. Proceedings of the IEEE, vol. 78,
powers are limited. We have presented a
no. 9, pp. 1464-1480.
naturally intuitive method to visualize SOM
8. Waller, N. G., Kaiser, H. A., Illian, J. B.,
outputs which facilitates multi-variable analysis
and Manry, M. (1998), A comparison
and is also highly scalable.
of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three
REFERENCE 1. Grimes, S., Unstructured data and
hierarchical cluster analysis algorithms.
the 80 percent rule. Retrieved from
Psychometrika, vol. 63, no.1, pp. 5-22.
78
9. Carpenter, G. A., and Grossberg, S.
12. Yan, J. (2012), Self-Organizing Map (with
(1987), A massively parallel architecture
application in gene clustering) in R.
for a self-organizing neural pattern
Available at http://cran.r-project.org/
recognition machine. Computer vision,
web/packages/som/som.pdf.
graphics, and image processing, vol. 37,
13. Dasgupta, A., Gurevich, M., & Punera,
no. 1, pp. 54-115.
K. (2011), Enhanced email spam filtering
10. Kohonen, T., and Somervuo, P. (2002),
through combining similarity graphs.
How to make large self-organizing maps
In Proceedings of the fourth ACM
for non-vectorial data. Neural Networks,
international conference on Web search
vol.15, no. 8, pp. 945-952.
and data mining, pp. 785-794.
11. Wehrens, R & Buydens, L.M.C (2007),
14. Cormack, G. V. (2007), Email spam
Self- and Super-organizing Maps in
filtering: A systematic review.
R: The Kohonen Package. Journal of
Foundations and Trends in Information
Statistical Software, vol. 21, no. 5, pp. 1-19.
Retrieval, vol. 1, no. 4, pp. 335-455.
79
NOTES
Index Automated Content Discovery 48, 49,
Extreme Content Hub, also ECH 47-51
Big Data
Global Positioning Service, also
Analytics 4-8, 19, 24, 40-43, 45, 67,
GPS 10, 13, 17, 54, 56
Lifecycle 21,
Management
Medical Engine 42- 44
Business Process, also BPM 30,
Value, also BDV 27, 29,
Custom Relationship, also CRM 28-30
Campaign Management 31, 32,
Information 3, 56-57
Common Warehouse Meta-Model, also CWM 7
Liquidity Risk, also LRM 35-40
Communication Service Providers, also
Master Data 5-6
CSPS 27,
Offer 32
Complex Event Processing, also CEP 53-63
Order 30
Content
Retention 31, 32
Processing Workflows 50
Metadata
Publishing Lifecycle Management, also
Discovery 6-7
CPLM 48,
Extractor 50,
Management System, also CMS 30, 48, 51
Governance 6-7
Contingency Funding Planning, also CFP 36,
Management 3-8
Customer
Net Interest Income Analysis, also NIIA 37
Dynamics 19-21, 25
Predictive
Relationship 28, 30
Intelligence 19
Data Warehouse 4- 5, 30, 38-39, 66, 68
Modeling 32
Enterprise Service Bus, also ESB 30
Analytics 54
Event Driven
Service Management 31, 33
Process Automation
Supply Chain Planning 9-12, 53
Architecture, also EDA 30-31
Un-Structured Content Extractor 50
Experience Personalization 31
Web Analytics 21
81
Infosys Labs Briefings BUSINESS INNOVATION through TECHNOLOGY
Editor Praveen B Malla PhD
Editorial Office: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore 560100, India Email: [emailprotected] http://www.infosys.com/infosyslabsbriefings
Deputy Editor Yogesh Dandawate Graphics & Web Editor Rakesh Subramanian Chethana M G Vivek Karkera IP Manager K V R S Sarma Marketing Manager Gayatri Hazarika Online Marketing Sanjay Sahay Production Manager Sudarshan Kumar V S Database Manager Ramesh Ramachandran Distribution Managers Santhosh Shenoy Suresh Kumar V H
Infosys Labs Briefings is a journal published by Infosys Labs with the objective of offering fresh perspectives on boardroom business technology. The publication aims at becoming the most sought after source for thought leading, strategic and experiential insights on business technology management. Infosys Labs is an important part of Infosys’ commitment to leadership in innovation using technology. Infosys Labs anticipates and assesses the evolution of technology and its impact on businesses and enables Infosys to constantly synthesize what it learns and catalyze technology enabled business transformation and thus assume leadership in providing best of breed solutions to clients across the globe. This is achieved through research supported by state-of-the-art labs and collaboration with industry leaders. About Infosys Many of the world’s most successful organizations rely on Infosys to deliver measurable business value. Infosys provides business consulting technology, engineering and outsourcing services to help clients in over 32 countries build tomorrow’s enterprise.
How to Reach Us: Email: [emailprotected]
For more information about Infosys (NASDAQ:INFY), visit www.infosys.com
Phone: +91 40 44290563 Post: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore 560100, India Subscription: [emailprotected] Rights, Permission, Licensing and Reprints: [emailprotected]
© Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained herein or to any derived results obtained by the recipient from the use of the information in this document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising therefrom. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.
ART & SCIENCE AN INFOSYS PUBLICATION
the DIGITAL ENTERPRISE
U R HERE
REVOLUTION Social
Video
.com
Mobile
Big Data
Cloud
Search
Apps
CL 60% UD OF SERVER WORKLOADS WILL BE VIRTUALIZED IN 2 YEARS 60%
62%
Fortune 500 companies with blogs
Fortune 500 companies active on Twitter
2014
MOBILITY
MOBILE STRATEGY
31
HOW THE WORLD GETS ONLINE 47 Ap ,00 do p St0 wn or loa e ds
695,0 status0u0 Facebook pdates
23%
12% 2008
IN ONE MINUTE ...
SOCIAL BUSINESS
0 00 0, ts 10 wee T
%
5.5b
via mobile
1.5b
via desktop
OF COMPANIES REPORT THEY ARE
JUST STARTING TO DEVELOP
ites 571w webs ne
A MOBILE STRATEGY OR HAVE NO MOBILE STRATEGY AT ALL.
2 million Google searches = 100 MILLION
ONLINE RETAIL U.S. OUTLOOK: GROWTH
$226B
45%
$327B
2012
2016
167 million people
192 million people
BIG DATA
90%
OF THE WORLD’S DATA WAS CREATED IN THE LAST
2
YEARS
APPS WHERE MOBILE USERS SPEND TIME (billions of minutes per month) 83 70
43
42
16
17
54
54
21
23
71
72
25
27
25
79
59
23
24
26
20
21
23
MAR APR MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR 2011 2012
Mobile Web
Click here to explore the current issue of Art & Science
72
101
88
Mobile Apps
Big Data: Countering Tomorrow’s Challenges Infosys Labs Briefings Advisory Board
Anindya Sircar PhD Associate Vice President & Head - IP Cell Gaurav Rastogi Vice President, Head - Learning Services Kochikar V P PhD Associate Vice President, Education & Research Unit Raj Joshi Managing Director, Infosys Consulting Inc. Ranganath M Vice President & Chief Risk Officer Simon Towers PhD Associate Vice President and Head - Center of Innovation for Tommorow’s Enterprise, Infosys Labs Subu Goparaju Senior Vice President & Head - Infosys Labs
Authors featured in this issue AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [emailprotected].
Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.
ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].
As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.
BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [emailprotected].
Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies. Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data. Like always do let us know your feedback about the issue. Happy Reading,
AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [emailprotected]. ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].
GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [emailprotected]. KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [emailprotected]. MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [emailprotected]. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [emailprotected]. NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [emailprotected]. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [emailprotected]. PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [emailprotected]. PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [emailprotected]. PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [emailprotected].
Yogesh Dandawate Deputy Editor [emailprotected]
SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [emailprotected]. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [emailprotected]. SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [emailprotected]. ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [emailprotected].
“At Infosys Labs, we constantly look for opportunities to leverage
Senior Vice President
technology while creating and implementing innovative business
and Head of Infosys Labs
Infosys Labs Briefings
Subu Goparaju
solutions for our clients. As part of this quest, we develop engineering methodologies that help Infosys implement these solutions right,
For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: [emailprotected]
© Infosys Limited, 2013
BIG DATA: CHALLENGES AND OPPORTUNITIES
first time and every time.”
Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.
VOL 11 NO 1 2013
makes no express or implied warranties relating to the information contained in this document or to
Infosys Labs Briefings VOL 11 NO 1 2013
BIG DATA: CHALLENGES AND OPPORTUNITIES
$ £¥ € €
¥ £
$