[PDF] Infosys Labs Briefings - Free Download PDF (2024)

Download Infosys Labs Briefings...

“At Infosys Labs, we constantly look for opportunities to leverage

Senior Vice President

technology while creating and implementing innovative business

and Head of Infosys Labs

Infosys Labs Briefings

Subu Goparaju

solutions for our clients. As part of this quest, we develop engineering methodologies that help Infosys implement these solutions right,

For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: [emailprotected]

© Infosys Limited, 2013

BIG DATA: CHALLENGES AND OPPORTUNITIES

first time and every time.”

Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.

VOL 11 NO 1 2013

makes no express or implied warranties relating to the information contained in this document or to

Infosys Labs Briefings VOL 11 NO 1 2013

BIG DATA: CHALLENGES AND OPPORTUNITIES

$ £¥ € €

¥ £

$

Big Data: Countering Tomorrow’s Challenges Infosys Labs Briefings Advisory Board

Anindya Sircar PhD Associate Vice President & Head - IP Cell Gaurav Rastogi Vice President, Head - Learning Services Kochikar V P PhD Associate Vice President, Education & Research Unit Raj Joshi Managing Director, Infosys Consulting Inc. Ranganath M Vice President & Chief Risk Officer Simon Towers PhD Associate Vice President and Head - Center of Innovation for Tommorow’s Enterprise, Infosys Labs Subu Goparaju Senior Vice President & Head - Infosys Labs

Authors featured in this issue AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [emailprotected].

Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.

ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].

As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.

BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [emailprotected].

Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies. Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data. Like always do let us know your feedback about the issue. Happy Reading,

AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [emailprotected]. ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].

GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [emailprotected]. KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [emailprotected]. MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [emailprotected]. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [emailprotected]. NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [emailprotected]. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [emailprotected]. PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [emailprotected]. PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [emailprotected]. PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [emailprotected].

Yogesh Dandawate Deputy Editor [emailprotected]

SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [emailprotected]. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [emailprotected]. SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [emailprotected]. ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [emailprotected].

Infosys Labs Briefings VOL 11 NO 1 2013

Opinion: Metadata Management in Big Data By Gautham Vemuganti Any enterprise that is in the process of or considering Big data applications deployment has to address the metadata management problem. The author proposes a metadata management framework to realize Big data analytics.

Trend: Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj The paper tries to explore the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization.

Discussion: Retail Industry – Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu Big data analysis of customers’ preferences can help retailers gain a significant competitive advantage, suggest the authors.

Perspective: Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD Always-on digital customers continuously create more data in various types. Enterprise are analyzing this heterogeneous data for understanding customer behavior, spend, social media patterns.

Framework: Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). The author proposes an iterative framework for effective liquidity risk management.

Model: Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi In this paper the authors describe how Big data analytics can play a significant role in the early detection and diagnosis of fatal diseases, reduction in health care costs improving quality of health care administration.

Approach: Big Data Powered Extreme Content Hub

3

9

19

27

35

41

47

By Sudeeshchandran Narayanan and Ajay Sadhu With the arrival of Big Content, the need to extract, enrich, organize and manage semi-structured and un-structured content and media is increasing. This paper talks about the need for an Extereme Content Hub to tame the Big data explosion.

Insight: Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur Complex Event Processing along with in-memory data grid technologies can help in pattern detection, matching, analysis, processing and split second decision making in Big data scenarios opine the authors.

Practioners Perspective: Big Data: Testing Approach to Overcome Quality Challenges

53

65

By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja This paper suggests the need for a robust testing approach to validate Big data systems to identify possible defects early in the implementation life cycle.

Research: Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash Classical visualization methods are falling short in accurately representing the multidimensional and ever growing Big data. Taking inspiration from nature, the author has proposed a nature inspired spider cobweb visualization technique for visualization of Big data. Index

73

“Robust testing approach needs to be defined for validating structured and unstructured data to identify possible defects early in the implementation life cycle.”

Naju D. Mohan Delivery Manager, RCL Business Unit Infosys Ltd.

“Big Data augmented with Complex Event Processing capabilities can provide solutions in utilizing memory data grids for analyzing trends, patterns and events in real time.”

Bill Peer Principal Technology Architect Infosys Labs, Infosys Ltd.

Infosys Labs Briefings VOL 11 NO 1 2013

Metadata Management in Big Data By Gautham Vemuganti

Big data analytics must reckon the importance and criticality of metadata

B

ig data, true to its name, deals with large

information management solution, its metadata

volumes of data characterized by volume,

should be correctly defined.

variety and velocity. Any enterprise that is

Metadata management solutions

in the process of or considering a Big data

provided by various vendors usually have

applications deployment has to address the

a narrow focus.An ETL vendor will capture

metadata management problem. Traditionally,

metadata for the ETL process.A BI vendor will

much of the data that business users use is

provide metadata management capabilities

structured. This however is changing with the

for their BI solution. The silo-ed nature of

exponential growth of data or Big data.

metadata does not provide business users an

Metadata defining this data, however,

opportunity to have a say and actively engage

is spread across the enterprise in spreadsheets,

in metadata management. A good metadata

databases, applications and even in people’s

management solution must provide visibility

minds (the so-called “tribal knowledge”). Most

across multiple solutions and bring business

enterprises do not have a formal metadata

users into the fold for a collaborative, active

management process in place because of

metadata management process.

the misconception that it is an Information Technology (IT) imperative and it does not have

METADATA MANAGEMENT CHALLENGES

an impact on the business.

Metadata, simply defined, is data about data.

However, the converse is true. It has been

In the context of analytics some common

proven that a robust metadata management

examples of metadata are report definitions,

process is not only necessary but required for

table definitions, meaning of a particular master

successful information management. Big data

data entity (sold-to customer, for example),

introduces large volumes of unstructured data

ETL mappings and formulas and computations.

for analysis. This data could be in the form of a

The importance of metadata cannot be

text file or any multimedia file (for e.g., audio,

overstated. Metadata drives the accuracy of

video). To bring this data into the fold of an

reports, validates data transformations, ensures

3

Multiple governance process

Single monolithic governance process

People

Rules

Metrics Process

People

Rules

Metrics

Process People

Rules

People

Metrics

Process

Metrics Process

Rules

Figure 1: Data Governance Shift with Big Data Analytics

Source: Infosys Research

accuracy of calculations and enforces consistent

a report or a calculation is run or two divisional

definition of business terms across multiple

data sources are merged.

business users.

Metadata is typically viewed as the

In a typical large enterprise which has

exclusive responsibility of the IT organization

grown by mergers, acquisitions and divestitures,

with business having little or no input or say in

metadata is scattered across the enterprise in

its management. The primary reason is that there

various forms as noted in the introduction.

are multiple layers of organization between IT

In large enterprises, there is wide

and business. This introduces communication

acknowledgement that metadata management

barriers between IT and business.

is critical but most of the time there is no

Finally, metadata is not viewed as a very

enterprise level sponsorship of a metadata

exciting area of opportunity.It is only addressed

management initiative.Even if there is, it is only

as an after-thought.

focused either for one specific project sponsored by one specific business.

DIFFERENCES BETWEEN TRADITIONAL

The impact of good metadata

AND BIG DATA ANALYTICS

management practices are not consistently

In traditional analytics, implementations

understood across the various levels of the

data is typically stored in a data warehouse.

enterprise. Conversely, the impact of poorly

The data warehouse is modeled using one

managed metadata comes to light only after

of several techniques, developed over time

the fact i.e., a certain transformation happens,

and is a constantly evolving entity. Analytics

4

application developed using the data in a data

the inference engine has to rely on metadata

warehouse are also long-lived. Data governance

as well as the supporting domain ontology.

in traditional analytics is a centralized process.

The metadata will define “Wireless Carrier”,

Metadata is managed as part of the data

“Customer”, “Sentiment” and “Intent”.The

governance process.

inference engine will leverage the ontology

In traditional analytics, data is discovered,

dependent on this metadata to infer that this

collected, governed, stored and distributed.

customer wants to switch cell phone carriers.

Big data introduces large volumes of

Big data is not just restricted to text.It

unstructured data.This data changes is highly

could also contain images, videos, and voice

dynamic and therefore needs to be ingested

files. Understanding, categorizing and creating

quickly for analysis.

metadata to analyze this kind of non-traditional

Big data analytics applications,

content is critical.

however, are characterized by short-lived,

It is evident that Big data introduces

quick implementations focused on solving a

additional challenges in metadata management.It

specific business problem.The emphasis of

is clear that there is a need for a robust metadata

Big data analytics applications is more on

management process which will govern metadata

experimentation and speed as opposed to long

with the same rigor as data for enterprises to be

drawn out modeling exercise.

successful with Big data analytics.

The need to experiment and derive

To summarize, a metadata management

insights quickly using data changes the way

process specific to Big data should incorporate

data is governed. In traditional analytics

the context and intent of data, support non-

there is usually one central governance team

traditional sources of data and be robust to

focused on governing the way data is used

handle the velocity of Big data.

and distributed in the enterprise.In Big data analytics, there are multiple governance

ILLUSTRATIVE EXAMPLE

processes in play simultaneously, each geared

Consider an existing master data management

towards answering a specific business question.

system in a large enterprise.This master data

Figure 1 illustrates this.

system has been developed over time.This

Most of the metadata management

has specific master data entities like product,

challenges we referred to in the previous section

customer, vendor, employee etc.The master data

alluded to typical enterprise data that is highly

system is tightly governed and data is processed

structured. To analyze unstructured data,

(cleansed, enriched and augmented) before it is

additional metadata definitions are necessary.

loaded into the master data repository.

To illustrate the need to enhance metadata

This specific enterprise is considering

to support Big data analytics, consider sentiment

bringing in social media data for enhanced

analysis using social media conversations as

customer analytics.This social media data is to be

an example. Say someone posts a message on

sourced from multiple sources and incorporated

Facebook “I do not like my cell-phone reception.

into the master data management system.

My wireless carrier promised wide cell coverage

As noted earlier, social media

but it is spotty at best.I think I will switch

conversations have context, intent and

carriers”. To infer the intent of this customer,

sentiment.The context refers to the situation

5

in which a customer was mentioned, the intent

for this master data will now have to incorporate

refers to the action that an individual is likely

the social media data as well. Furthermore, the

to take and the sentiment refers to the “state of

customer information extracted from these feeds

being” of the individual.

need to be standardized before being loaded into

For example, if an individual sent a

any transaction system.

tweet or a starts a Facebook conversation about a retailer from a football game. The context

FRAMEWORK FOR METADATA

would then be a sports venue. If the tweet or

MANAGEMENT IN BIG DATA ANALYTICS

conversation consisted of positive comments

We propose that metadata be managed using

about the retailer then the sentiment would be

5 components shown in Figure 2.

determined as positive. If the update consisted of highlighting a promotion by the retailer then

Metadata Discovery – Discovering metadata

the intent would be to collaborate or share with

is critical in Big data for the reasons of context

the individual’s network.

and intent noted in the prior section. Social data

If such social media updates have to

is typically sourced from multiple sources.All

be incorporated into any solution within the

these sources will have different formats. Once

enterprise then the master data management

metadata for a certain entity is discovered for

solution has to be enhanced with metadata about

one source it needs to be harmonized across all

“Context”, ”Sentiment” and “Intent”. Static

sources of interest. This process for Big data

lookup information will need to be generated

will need to be formalized using metadata

and stored so that an inference engine can

governance.

leverage this information to provide inputs for analysis. This will also necessitate a change in the

Metadata Collection – A metadata collection

back-end.The ETL processes that are responsible

mechanism should be implemented. A robust collection mechanism should aim to minimize or eliminate metadata silos. Once again, a technology or a process for metadata collection

METADATA DISCOVERY

Collect

should be implemented. Metadata Governance – Metadata creation

METADATA COLLECTION

and maintenance needs to be governed. Governance should include resources from

METADATA GOVERNANCE

both the business and IT teams. A collaborative framework between business and IT should be established to provide this governance.

METADATA STORAGE

Appropriate processes (manual or technical) should be utilized for this purpose. For example,

METADATA DISTRIBUTION

on-boarding a new Big data source should be a collaborative effort between business users and IT. IT will provide the technology to enable

Figure 2: Metadata Management Framework for Big Data Analytics Source: Infosys Research

business users discover metadata.

6

METADATA DISCOVERY

Collect

DATA DISCOVERY

METADATA COLLECTIO N

Collect

DATA COLLECTION

METADATA GOVERNANCE

DATA GOVERNANCE

METADATA STORAGE

DATA STORAGE

METADATA DISTRIBUTIO N

DATA DISTRIBUTION

BIG DATA DISTRIBUTION Figure 3: Equal Importance of Metadata & Data Processing for Big Data Analytics

Source: Infosys Research

Metadata Storage – Multiple models for

The metadata management framework

enterprise metadata storage exist.The Common

should be implemented alongside a data

Warehouse Meta-model (CWM) is one example.

management framework to realize Big data analytics.

A similar model or its extension thereof can be utilized for this purpose.If one such model will

THE PARADIGM SHIFT

not fit the requirements of an enterprise then

The discussion in this paper brings to light the

suitable custom models can be developed.

importance of metadata and the impact it has not only on Big data analytics but traditional

Metadata Distribution – This is the final

analytics as well.We are of the opinion that if

component. Metadata, once stored will need

enterprises want to get value out of their data

to be distributed to consuming applications.A

assets and leverage the Big data tidal wave then

formal distribution model should be put into

the time is right to shift the paradigm from

place to enable this distribution. For example,

data governance to metadata governance and

some applications can directly integrate to

make data management part of the metadata

the metadata storage layer while others will

governance process.

need some specialized interfaces to be able to

A framework is as good as how it is

leverage this metadata.

viewed and implemented within the enterprise.

We note that in traditional analytics

The metadata management framework is

implementation, a framework similar to the one

successful if there is sponsorship for this effort

we propose exists but with data.

from the highest levels of management.This

7

include both business and IT leadership within

needed to analyze Big data to companies engaged

the enterprise. The framework can be viewed as

in Big data analysis and selling that content.

being very generic. Change is a constant in any

In the midst of all the innovation in the

enterprise.The framework can be made flexible

Big data space, metadata is often forgotten. It

to adapt to changing needs and requirements

is important for us to recognize and realize the

of the business.

importance of metadata management and the

All the participants and personas in

critical impact it has on enterprises.

engaged in the data management function within

If enterprises wish to remain competitive,

an enterprise should participate in the process.

they have to embark on Big data analytics

This will promote and foster collaboration

initiatives.In this journey, enterprises cannot

between business and IT.This should be made

afford to ignore the metadata management

sustainable and followed diligently by all the

problem.

participants until this framework is used to onboard not only new data sources but also new

REFERENCES

participants in the process.

1. Davenport, T., and Harris, J., (2007),

Metadata and its management is an

Competing on Analytics – The New

oft ignored area in enterprises with multiple

Science of Winning, Harvard Business

consequences.The absence of robust metadata

School Press.

management processes lead to erroneous results,

2. J e n n i n g s , M . , W h a t r o l e d o e s

project delays and multiple interpretations of

metadata management play in

business data entities. These are all avoidable

enterprise information management

with a good metadata management framework.

(EIM)?. Available at http://

The consequences affect the entire

searchbusinessanalytics.techtarget.com/

enterprise either directly or indirectly.From

answer/The-importance-of-metadata-

the lowest level employee to the senior

management-in-EIM.

most executive, incorrect or poorly managed

3. Metadata Management Foundation

metadata not only will affect operations but also

Capabilities Component (2011). http://

directly contribute to the top-line growth and

mike2.openmethodology.org/wiki/

bottom-line profitability of an enterprise. Big

Metadata_Management_Foundation_

data is viewed as the most important innovation

Capabilities_Component.

that brings tremendous value to enterprises.

4. Rogers, D. (2010), Database Management:

Without a proper metadata management

Metadata is more important than you think.

framework, this value might not be realized.

Available at http://www.databasejournal. com/sqletc/article.php/3870756/ Database-Management-Metadata-is-more-

CONCLUSION Big data has created quite a bit of buzz in the

important-than-you-think.htm.

market place.Pioneers like Yahoo and Google

5. Data Governance Institute, (2012), The

created the foundations of what is today called

DGI Data Governance Framework.

Hadoop.There are multiple players in the Big

Available a t http://datagovernance.

data market today developing everything from

com/fw_the_DGI_data_governance_

technology to manage Big data to applications

framework.html.

8

Infosys Labs Briefings VOL 11 NO 1 2013

Optimization Model for Improving Supply Chain Visibility By Saravanan Balaraj

Enterprises need to adopt different Big data analytic tools and technologies to improve their supply chains

I

n today’s competitive ‘lead or leave’

areas that are undergoing transformational

marketplace, Big data is seen as an

changes in the recent past. Traditional supply

oxymoron that offers both challenge as well as

chain applications leverage only on transactional

opportunity. Effective and efficient strategies

data to solve operational problems and improve

to acquire, manage and analyze data leads

efficiency. Having stepped into Big data world,

to better decision making and competitive

the existing supply chain applications have

advantage. Unlocking potential business

become obsolete as they are unable to cope up

value out of this diverse and multi-structured

with tremendously increasing data volumes

dataset beyond organizational boundary is a

cutting across multiple sources, the speed with

mammoth task.

which they are generated and unprecedented

We have stepped into an interconnected

growth in new data forms.

and intelligent digital world where convergence

Enterprises are in tremendous pressure

of new technologies is fast happening round

to solve new problems emerging out of new

the corners. In this process the underlying

forms of data. Handling large volume of data

data set is growing not only in volumes but

across multiple sources and deriving value out

also in velocity and variety. The resulting data

of this massive chunk for strategy execution

explosion created by a combination of mobile

is the biggest challenge that enterprises are

devices, tweets, social media, blogs, sensors and

facing in today’s competitive landscape.

emails demands a new kind of data intelligence.

Careful analysis and appropriate usage of

Big data has started creating lot of buzz

these data would result in cost-reduction and

across verticals and Big data in supply chain is

better operational performance. Competitive

no different. Supply chain is one of the key focus

pressures and customers ‘more for less’

9

attitudes have left enterprise with no choice

chain planning where Big data can create an

other than to re-think on their supply chain

impact are: demand forecasting, inventory

strategies and creating a differentiation.

management, production planning, vendor

Enterprises need to adopt appropriate

management and logistics optimization. Big

Big data techniques and technologies and

data can improve supply chain planning process

build suitable models to derive value out

if appropriate business models are identified,

of these unstructured data and henceforth

designed, built and then executed. Some of

plan, schedule and route in a cost-effective

its key benefits are: short time-to-market,

manner. The paper tries to explore what are

improved operational excellence, cost reduction

the challenges that dot the Big data adoption in

and increased profit margins.

supply chain and proposes a value model for Big data optimization.

CHALLENGES WITH SUPPLY CHAIN PLANNING

BIG DATA WAVE

Supply

International Data Corporation (IDC) has

depends

chain

predicted that Big data market will grow from

forecasted, inventories are managed and

$3.2 billion in 2010 to $16.9 billion by 2015

logistics are planned. Supply chain is the

at a compound annual growth rate of 40%

heart of industry vertical and if managed

[2]. This shows tremendous traction towards

efficiently drives positive business and enables

Big data tools, technologies and platforms

sustainable advantage. With the emergence of

among enterprises. Lots of researches and

Big data, optimizing supply chain processes

investments are carried out on how to fully tap

has become complicated than ever before.

the potential benefits hidden in Big data and

Handling Big data challenges in supply chain

derive financial value out of it. Value derived

and transforming them into opportunities

out of Big data enables enterprises to achieve

is the key to corporate success. The key

differentiation by reducing cost, efficient

challenges are:

on

planning how

process

closely

success

demands

are

planning and thereby improving process efficiency.

■■ Volume - According to a McKinsey

Big data is an important asset in supply

report, the number of RFID tags sold

chain which enterprises are looking forward

globally is projected to increase from

to capitalize upon. They adopt different Big

12 million in 2011 to 209 billion in

data analytic tools and technologies to improve

2021 [3]. Along with this, phenomenal

their supply chain, production and customer

increase in the usage of temperature

engagement processes. The path towards

sensors, QR codes and GPS devices, the

operational excellence is facilitated through

underlying supply chain data generated

efficient planning and scheduling of production

has multiplied manifold beyond our

and logistic processes.

expectations. Data is flowing across

Though supply chain data is really huge,

multiple systems and sources and hence

it brings about the biggest opportunity for

they are likely to be error-prone and

enterprises to reduce cost and improve their

incomplete. Handling such huge data

operational performances. The areas in supply

volumes is a challenge.

10

Launch Promotion

Customer

Inventory

Transportation

Data Sourcing Sensor RFID QR

Structured

Unstructured

Transactional

Social

Time bound

Channel

Temperature

New Type Video Voice Digital Image

Data Extraction & Cleansing Transactional Systems

Big Data Systems Cascading | Hive Pig | MapReduce HDFS | NoSQL

OLTP

DB

Data Representation Acquire Source: Infosys Research

Figure 1: Optimization Model for Improving Supply Chain Visibility - I

■■ Velocity - Business has become highly

(temperature and RFID) along with

dynamic and volatile. The changes arising

new data types (video, voice and digital

due to unexpected events must be handled

images) have created nightmares among

in a timely manner in order to avoid losing

enterprise to handle such diverse and

out in business. Enterprises are finding it

heterogeneous data sets.

extremely difficult to cope up with this data velocity. Optimal decisions must

In today’s data explosion in terms

be made quickly and shorter processing

of volume, variety and velocity, handling

time is the key for successful operational

them alone doesn’t suffice. Value creation by

execution which is lacking in traditional

analyzing such massive data sets and extraction

data management systems.

of data intelligence for successful strategy execution is the key.

■■ Variety - In supply chain, data has emerged in different forms which

BIG DATA IN DEMAND FORECASTING &

don’t fit in traditional applications and

SUPPLY CHAIN PLANNING

models. Structured (transactional data),

Enterprises use forecasting to determine how

unstructured (social data), sensor data

much to produce of each product type, when

11

and where to ship them, thereby improving

■■ Social Media Data As An I nput:

supply chain visibility. Inaccurate forecast

Social media is a platform that enables

causes detrimental effect in supply chain.

enterprises to collect information

Over-forecast results in inventory pile ups

about potential and prospect

and working capital locks. Under-forecast

customers. Thanks to the technological

leads to failure in meeting demand, resulting

advancements that has made tracking

in loss of customer and sales. Hence in today’s

customer data easier. Companies can

volatile market comprised of unpredictable

now track every visit customer makes

shifts in customer demands, improving

to the websites, e-mail exchanged and

accuracy

comments logged across social media

of

forecast

is

of

paramount

importance.

websites. Social media data helps

Data in supply chain planning has

analyze customer pulse and gain insights

mushroomed in terms of volumes, velocity

on forecasting, planning, scheduling of

and variety. Tesco, for instance, generates

supply chain and inventories. Buzz in

more than 1.5 billion new data items every

social networks can be used as an input

month. Wal-Mart’s warehouse handles

for demand forecasting for numerous

some 2.5 petabytes of information which is

benefit. One such use case is, enterprise

roughly equivalent to half of all the letters

can launch a new product to online fans

delivered by the US Postal Service in 2010.

to sense customer acceptance. Based on

According to McKinsey Global institute

the response, inventories and supply

report [3], leveraging on Big data in demand

chain can be planned to direct stocks

forecasting and supply chain planning could

to high buzz locations during launch

increase profit margin by 2-3% in Fast Moving

phase.

Consumer Goods (FMCG) manufacturing v a l u e c h a i n . T h i s u n e a r t h s t r e m e n d o us

■■ P r e d i c t A n d R e s p o n d A p p r o a c h :

opportunity in forecasting and supply chain

Traditional forecasting is done by

planning available for enterprises to capitalize

analyzing historical patterns, considering

on this Big data deluge.

sales inputs and promotional plans to forecast demand and supply chain

MISSING LINKS IN TRADITIONAL

planning. They focus on ‘what happened’

APPROACHES

and work on ‘sense and respond’ strategy.

Enterprises have started realizing the

‘History repeats itself’ is no longer apt

importance of Big data in forecasting and

in todays’ competitive marketplace.

have begun investing in Big data forecasting

Enterprises need to focus on ‘what

tools and technologies to improve their supply

will happen’ and require ‘predict and

chain, production and manufacture planning

respond’ strategy to stay alive in business.

processes. Traditional forecasting tools aren’t

This calls for models and systems capable

adequate enough in handling huge data

of capturing, handling and analyzing

volumes, variety and velocity. Moreover they

huge volume of real-time data generated

are missing out on the following key aspect

from unexpected competitive events,

which improves accuracy of forecasts:

weather patterns, point-of-sales and

12

natural disasters (volcanoes, floods, etc.)

significant financial benefits. Let’s take a deep

and converting them into actionable

dive into each stage of this model and analyze

information for forecasting plans on

what their value-add are in enterprises supply

production, inventory holdings and

chain planning process.

supply chain distribution. Acquire Data: The biggest driver of supply ■■ Optimized Decisions with Simulations:

chain planning is data. Acquiring all the relevant

Traditional decision support systems

data for supply chain planning is the first step

lack flexibility to meet changing data

in this optimized model. It involves three steps

requirements. In real world scenario,

namely data sourcing, data extraction and

supply chain delivery plan changes

cleansing and data representation which make

unexpectedly due to various reasons

data ready for further analysis.

like demand change, revised sales forecast, etc. The model and system

■■ Data Sourcing - Data is available in

should have ability to factor in this and

different forms across multiple sources,

respond quickly to such unplanned

systems and geographies. It contains

events. Decision should be taken only

extensive details of historical demand

after careful analysis of the unplanned

data and other relevant information. For

events impact on other elements of

further analysis it is therefore necessary

supply chain. Traditional approaches

to source required data. Data that are

lack this capability and this necessitates

to be sourced for improving accuracy

a model for performing what-if analysis

of forecast in-addition to transactional

on all possible decisions and selecting

data are:

the optimal one in the Big data context. ■■ Product Promotion data - items, prices, sales

IMPROVING SUPPLY CHAIN VISIBILITY USING BIG DATA Supply chain doesn’t lack data – what’s missing

■■ Launch data - items to be ramped up

is a suitable model to convert this huge diverse

or down

raw data into actionable information so that enterprises can make critical business decisions

■■ Inventory data - stock in warehouse

for efficient supply chain planning. A 3-stage optimized value model helps to overcome

■■ Customer data - purchase history,

the challenges posed by Big data in supply

social media data

chain planning and demand forecasting. It bridges the existing gaps in traditional Big

■■ Transportation data - GPS and

data approaches and offers a perspective

logistics data.

to unlock the value from growing Big data torrent. Designing and building an optimized

Enterprises should adopt appropriate

Big data model for supply chain planning is a

Big data systems that are capable of handling

complex task but successful execution leads to

such huge data volumes, variety and velocity.

13

■■ Data Extraction and Cleansing - Data

selection of Big data technique depends on the

sources are available in different forms

business scenario and enterprise objectives.

from structured (transactional data) to

Incompatible data formats make value creation

un-structured (social media, images,

from Big data a complex task and this calls for

sensor data, etc.) and they are not in

innovation in techniques to unlock business

analysis-friendly formats. Also due

value out of the growing Big data torrent. The

to large volume of heterogeneous

proposed model adopts optimization technique

data there is high probability of

to generate insights out of this voluminous and

inconsistencies and data errors while

diverse Big dataset.

sourcing. The sourced data should be expressed in structured form for supply

■■ Optimization in Big data analysis -

chain planning. Moreover analyzing

Manufacturers have started synchronizing

inaccurate and untimely data leads to

forecasting with production cycles,

erroneous non-optimal results. High

so accuracy of forecasting plays a

quality and comprehensive data is a

crucial role in their success. Adoption

valuable asset and appropriate data

of optimization technique in Big data

cleansing mechanisms should be in

analysis creates a new perspective and

place for maintaining the quality of Big

it helps in improving the accuracy of

data. Choice of Big data tools for data

demand forecasting and supply chain

cleansing and enrichment plays a crucial

planning. Analyzing the impact of

role in supply chain planning.

promotions on one specific product for demand forecasting appears to be an easy

■■ Data Representation – Database design

task. But real life scenarios comprises

for such huge data volume is a herculean

of huge army of products with factors

task and poses some serious performance

affecting their demand varying for

issues if not executed properly. Data

every product and location making it

representation plays a key role in Big

difficult for traditional techniques in

data analysis. There are numerous ways

data analysis.

to store data and each design has its own set of advantages and drawbacks.

Optimization technique has several

Selection of appropriate database design

capabilities which make it an ideal choice for

and executing appropriate design

data analysis in such scenarios. Firstly, this

favoring business objectives reduces the

technique is designed for analyzing and drawing

efforts in reaping benefits out of Big data

insights for highly complex system with huge

analysis in supply chain planning.

data volumes, multiple constraints and factors to be accounted for. Secondly, supply chain

Analyze Data: The next stage is analyzing

planning has number of enterprise objectives

cleansed data and capturing value for forecasting

associated with it like cost reduction, demand

and supply chain planning. There is plethora of

fulfillment, etc. The impact of each of these

Big data techniques available in market for

objective measures on enterprises profitability

forecasting and supply chain planning. The

can be easily analyzed using optimization

14

Data Sourcing Data Extraction & Cleansing Data Representation ACQUIRE OPTIMIZATION TECHNIQUE INPUT

GOALS

CONSTRAINTS

Min (Cost)

Capacity constraint

Max (Profit) Max (Demand Coverage)

Demand Coverage Constraint

Route Constraint OUTPUT Inventory Plan Demand Plan Logistics Plan

ANALYZE Scenario Management Performance Trackers KPI Dashboards Actual Vs. Planned

Multi User Collaboration

Build Compare

Simulate

ACHIEVE Figure 2: Optimization Model for Improving Supply Chain Visibility – II

Source: Infosys Research

technique. Flexibility of optimization technique

incorporate the entire constraints specific to

is another benefit that makes it suitable for Big

the supply chain planning in the model; some

data analysis to uncover new data connections

of the constraints are minimum inventory

and turn them into insights.

in warehouse, capacity constraint, route

Optimization model comprises of

constraint, demand coverage constraint, etc;

four components, viz., (i) input – consistent,

and (iv) output – results based on input, goals

real-time, quality data which is sourced,

and constraints defined in the model that can

cleansed and integrated becomes the input

be used for strategy executions. The result can

of the optimization model; (ii) goals – the

be demand plan, inventory plan, production

model should take into consideration all

plan, logistics plan, etc.

the goals pertaining to the forecasting and supply chain planning like minimizing cost,

■■ Choice of Algorithm: One of the key

maximizing demand coverage, maximizing

differentiators in supply chain planning

profits, etc. (iii) constraints – the model should

is the algorithm used in modeling.

15

Optimization problems have numerous

when business changes. This model

possible solutions and the algorithm

builds a collaborative system with

should have the capability to fine-tune

capability of supporting inputs from

itself for achieving optimal solutions.

multiple users and incorporating in its decision making process

Achieve Business Objective: The final stage in this model is achieving business objectives

■■ P e r f o r m a n c e T r a c k e r – D e m a n d

through demand forecasting and supply

forecasting and supply chain planning

chain planning. It involves three steps which

does not follow build-model-execute

facilitates enterprise in supply chain decisions.

approach, it requires significant continuous effort. Frequent changes in

■■ Scenario Management – Business events

the inputs and business rules necessitate

are difficult to predict and most of the

monitoring of data, model and algorithm

times deviate from their standard paths

performance. Actual and planned results

resulting unexpected behaviors and

are to be compared regularly and steps

events. This makes it difficult for planning

are to be taken to minimize the deviations

and optimizing during uncertain times.

in accuracy. KPI is to be defined and

Scenario management is the approach

dashboard should be constantly

to overcome such uncertain situations.

monitored for model performances.

Scenario management facilitates creating business scenarios, comparing multiple

KEY BENEFITS

different scenarios, analyze and assessing

Enterprises can accrue lot of benefit by adopting

its impact before making decisions. This

this 3-stage model for Big data analysis. Some of

capability helps to balance conflicting

them are detailed below:

KPIs and arrive at an optimal solution matching business needs.

Improves Accuracy of Forecast: One of the key objectives of forecasting is profit

■■ Multi User Collaboration – Optimization

maximization. This model adopts effective data

model in real business case comprises

sourcing, cleansing and integration systems and

of highly complex data sets and models

makes data ready for forecasting. Inclusion of

which requires support from an army

social media data, promotional data, weather

of analysts and determines its effects

predictions, seasonality’s in addition to

on enterprises goals. Combinations

historical demand and sales histories adds value

of technical and domain experts are

and improves forecasting accuracy. Moreover

required to obtain optimal results.

optimization technique for Big data analysis

To achieve near accurate forecasts

reduces forecasting errors to a great extent.

and supply chain optimization the model should support multi-user

Continuous Improvement: Acquire-Analyze-

collaboration so that multiple users can

Achieve model is not a hard-wired model. It

collaboratively produce optimal plans

allows flexibility to fine tune and supports

and schedules and re-optimize as and

what-if analysis. Multiple scenarios can be

16

created, compared and simulated to identify

no option other than reducing cost in their

the impact of change on the supply chain and

operational executions. Adopting effective

demand forecasting prior to the making any

and efficient supply chain planning and

decisions. Also it enables enterprise to define,

optimization techniques to match customer

track and monitor KPIs from time to time

expectations with its offerings is the key

resulting in continuous process improvements.

to corporate success. To attain operational excellence and sustainable advantage, it is

Better Inventory Management: Inventory data

necessary for the enterprise to build innovative

along with weather predictions, history of sales

models and frameworks leveraging the power

and seasonality is considered as an input to

of Big data.

the model for forecasting and planning supply

Optimized value model on Big data

chain. This approach minimizes incidents of

offers a unique way of demand forecasting

out-of-stock or over-stocks across different

and supply chain optimization through

warehouses. Optimal plan for inventory

collaboration, scenario management and

movement is forecasted and appropriate stocks

performance management. This model on

are maintained at each warehouse to meet the

continuous improvement opens up doors for big

upcoming demand. To a great extent this will

opportunities for the next generation of demand

reduce loss of sales and business due to stock-

forecasting and supply chain optimization.

outs and leads to better inventory management. REFERENCES Logistic Optimization: Constant sourcing

1. I D C - P r e s s R e l e a s e ( 2 0 1 2 ) , I D C

and continuous analysis of transportation

Releases First Worldwide Big data

data (GPS and other logistics data) and using

Technology and Services Market

them for demand forecasting and supply chain

Forecast, Shows Big data as the Next

planning through optimization techniques

Essential Capability and a Foundation

helps in improving distribution management.

for the Intelligent Economy. Available

Moreover optimization of logistics improves

at http://www.idc.com/getdoc.

fuel efficiency and efficient routing of vehicles

jsp?containerId=prUS23355112.

resulting in operational excellence and better

2. McKinsey Global Institute (2011), Big

supply chain visibility.

data: The next frontier for innovation, competition, and productivity. Available

CONCLUSIONS

at http://www.mckinsey.com/~/media/

As rapid penetration of information technology

McKinsey/dotcom/Insights%20and%20

in supply chain planning continues, the amount

pubs/MGI/Research/Technology%20

of data that can be captured, stored and analyzed

and%20Innovation/Big%20Data/MGI_

has increased manifold. The challenge is to

big_data_full_report.ashx.

derive value out of these large volumes of data

3. Furio, S.,

Andres, C., Lozano, S.,

by unlocking financial benefits in congruent

Adenso-Diaz, B., (2009), Mathematical

with the enterprises’ business objectives.

model to optimize land empty container

Competitive pressures and customers

movements. Available at http://

‘more for less’ attitude has left enterprises with

www.fundacion.valenciaport.com/

17

Articles/doc/presentations/HMS2009_

at: http://loci.cs.utk.edu/ibp/files/

Paperid_27_Furio.aspx.

pdf/LogisticalNetworking.pdf.

4. Stojkovića, G., Soumisb, F., Desrosiersc,

6. Lasschuit, W., Thijssen, N., (2004),

J., Solomon, M. (2001), An optimization

Supporting supply chain planning and

model for a real-time flight scheduling

scheduling decisions in the oil and

problem. Available at http://www.

chemical industry, Computers and

sciencedirect.com/science/article/pii/

Chemical Engineering, issue 28, pp. 863–

S0965856401000398.

870. Available at http://www.aimms.

5. Beck, M., Moore, T., Plank, J., Swany, M.

com/aimms/download/case_studies/

(2000), Logistical Networking. Available

shell_elsevier_article.pdf.

18

Infosys Labs Briefings VOL 11 NO 1 2013

Retail Industry – Moving to Feedback Economy By Prasanna Rajaraman and Perumal Babu

Gain better insight into customer dynamics through Big Data analytics

R

etail industry is going through a major

shared by customers. The more effective

paradigm shift. The past decade has seen

retailers can tap into these behavioral and social

unprecedented churn in retail industry virtually

reservoirs of data to model purchasing behaviors

changing the landscape. Erstwhile marquee

and trends of their current and prospective

brands from traditional retailing side have

customers. Such data can also provide the

ceded space to start-ups and new business

retailers with predictive intelligence, which

models.

if leveraged effectively can create enough

The key driver of this change is a

mindshare, that the sale is completed even

confluence of technological, sociological and

before the conscious decision to purchase is

customer behavioral trends creating this

taken.

strategic infection point in retailing ecology.

This move to a feedback economy

Trends like emergence of internet as major

where retailers can have 360 degree view of

retailing channel, social platforms going

the customer thought process across the selling

mainstream, pervasive retailing and emergence

cycle is a paradigm shift for retail industry –

of digital customer has presented a major

from retailer driving sales to retailer engaging

challenge to traditional retailers and retailing

customer across the sales and support cycle.

models.

Every aspect of retailing from assortment/

On the other hand, these trends have

allocation planning, marketing/promotions to

also enabled opportunities for retailers to better

customer interactions has to take the evolving

understand customer dynamics. For the first

consumer trends into consideration.

time, retailers have access to unprecedented

The implication from business

amount of publicly available information on

perspective is that retailers have to better

customer behavior and trends; voluntarily

understand customer dynamics and align

19

Implicit Guidance & Control

Analysis and Synthesis

Genetic Heritage

Unfolding Interaction with Enviroment

Forward

Observation New Information

Previous experiences

Feed

Cultural Transactions

Feed

Outside Information

Act

Decision (Hypothesis)

Feed

Implicit Guidance & Control

Decide

Forward

Unfolding Circ*mstances

Orient

Forward

Observe

Action (Test)

Unfolding Interaction with Enviroment

Feedback Feedback Feedback

Figure 1: OODA loop Source: Reference [5]

Source: Reference [5]

business processes effectively with these

TOWARDS A FEEDBACK ECONOMY

trends. In addition, this implies that cycle

Customer dynamics refers to customer-

times will be shorter and businesses have to be

business relationships that describe the ongoing

more tactical in their promotions and offerings.

interchange of information and transactions

Retailers who can ride this wave will be better

between customers and organizations that

able to address demand and command higher

goes beyond the transactional nature of

margins for the products and services. Failing

the interaction to look at emotions, intent

this, retailers will be left with low-margin

and desires. Retailers can create significant

pricing/commodity space.

competitive differentiation by understanding

From information technology

the customer’s true intent in a way that also

perspective, the key challenge is that nature

supports the business’ intents [1, 2, 3, 4].

of this information with respect to lifecycle,

John Boyd a colonel military strategist

velocity, heterogeneousness of the sources

in the US air force developed the OODA loop

and volume is radically different from what

(Observe, Orient, Decide and Act) which he

traditional systems handle. Also, there are

used for combative operations. Today’s business

overarching concerns like that of data privacy,

environment is nothing different Retailers

compliance and regulatory changes that need

are battling to get customer into their shops

to be internalized with internal processes. The

(physical or net-front) and convert their visits

key is to manage lifecycle of this Big data and

to sales. And understanding customer dynamics

effectively integrate with the organizational

play a key role in this effort. The OODA loop

system and to derive actionable information.

explains the crux of the feedback economy.

20

In a feedback economy, there is constant feedback

cannot be directly integrated with traditional

to the system from every phase of its execution.

analytics tool leading to challenges on how the

Along with this, the organization should

data can be assimilated with backend decision

observe the external environment, unfolding

making systems and analyzed.

circ*mstances and customer interactions. These

In the assimilate/analyze phase, retailer

inputs are analyzed and action is taken based

must decide which data is of use and define

on these inputs. This cycle of adaptation and

rules for filtering the unwanted data. Filtering

optimization makes the organization more

should be done with utmost care, as there are

efficient and effective on an ongoing basis.

cases where indirect inferences are possible. The

Leveraging this feedback loop is pivotal

data available to the retailer after the acquisition

in having a proper understanding of customer

phase would be of multiple formats and they

needs and wants and the evolving trends. In

have to be cleaned and harmonized with the

today’s environment, this means acquiring

backend platforms.

data from heterogeneous sources viz., in-

Cleaned up data is then mined for

store transaction history, web analytics, etc.

actionable insights. Actionize is a phase where the

This creates a huge volume of data that has

insights gathered from analyze phase is converted

to be analyzed to get the required actionable

to actionable business decisions by the retailer.

insights

The response i.e., business outcome is fed back to the system so that the system can

BIG DATA LIFECYCLE: ACQUIRE-

self-tune on an ongoing basis to result in a self-

ANALYZE-ACTIONIZE

adaptive system that leverages Big data and

The lifecycle of Big data can be visualized as a

feedback loops to offer business insight more

three-phased approach resulting in continuous

customized than what would be traditionally

optimization. The first step in moving towards

possible. It is imperative to understand that

feedback economy is to acquire data. In this

this feedback cycle is an ongoing process and

case, retailer should look into the macro and

not to be considered as a one stop solution for

micro environment trends, consumer behavior

the analytics needs of a retailer.

- their likes, emotions, etc. Data from electronic channels like blogs, social networking sites

ACQUIRE: FOLLOWING CUSTOMER

and twitter will give the retailer a humongous

FOOTPRINTS

amount of data regarding the consumer. These

To understand the customer, retailers have to

feeds help the retailer understand consumer

leverage every interaction with the customer

dynamics and give more insights into her

and tap into the source of customer insight.

buying patterns.

Traditionally, retailers have relied primarily on

The key advantage of plugging into these

in-store customer interactions and associated

disparate sources is the sheer information one

transaction data along with specialized campaigns

can gather about customer – both individually

like opinion polls to gain better insight into

and in aggregate. On other hand, Big data is

customer dynamics. While this interaction looks

materially different from the data the retailers

limited, a recent incident shows how powerful

are used to handling. Most of the data is

customer sales history can be leveraged to gain

unstructured (from blogs, twitter feeds, etc.) and

predictive intelligence on customer needs.

21

“A father of a teenage girl called in a

can result in generating data that is beyond

major North American retailer to complain that

what user originally consented to; potentially

the retailer had mailed coupons for child care

resulting in liability for the retailer. Given that

products addressed to his underage daughter.

most of this information is accessible globally,

Few days later, the same father called in and

retailers should ensure compliance with local

apologized that his daughter was indeed

regulations (EU data /privacy protection

pregnant and he was not aware of it earlier” [6].

regulations, HIPAA for US medical data, etc.)

Surprisingly, by all indications, only in-

where they operate.

store purchase data was mined by the retailer in this scenario to identify the customer need

ANALYZE - INSIGHTS (LEADS)

which in this case is that of childcare products.

TO INNOVATION

To exploit the power of next generation

Analyst Doug Laney defined data growth

of analytics retailers must plug into data from

challenges and opportunities as being three-

non-traditional sources like social sites, twitter

dimensional, i.e. increasing volume (amount of

feeds, environment sensor networks, etc. to

data), velocity (speed of data in and out), and

have better insight into customer needs. Most

variety (range of data types and sources)[9].

major retailers now have multiple channels –

The key to acquire Big data is to handle

brick/mortar store, online store, mobile apps,

these dimensions while assimilating these

etc. Each of these touch points not only acts

aforementioned external sources of data. To

as a sales channel but can also generate data

understand how Big data analytics can enrich

on customer needs and wants. Coupling this

and enhance a typical retail process – allocation

information with other repository like Facebook

planning – let’s look at the allocation planning

posts, twitter feeds (i.e., sentiment analysis) and

case study for a major North American apparel

web analytics retailers have the opportunity to

retailer.

track customer footprints both in and outside

The forecasting engine used for planning

the store and to customize their offerings and

process uses statistical algorithms to determine

interactions with customer.

allocation quantities. Key inputs to forecasting

Traditionally retailers have dealt with

engine are sales history and current performance

voluminous data. For example, Wal-Mart logs

of store. In addition, adjustments are also

more than 2.5 petabytes of information about

based on parameters like Promotional events

customer transactions every hour, equivalent

(including markdown), current stock levels,

to 167 times the books in the Library of

back orders to determine the inventory that

Congress [7].

needs to be shipped to particular store.

However, the nature of Big data is

While this is fairly in line with industry

materially different from traditional transaction

standard for allocation forecasting, Big data

data and this must be considered while data

can enrich this process by including additional

planning is done. Further, while data is

parameters that can impact demand. For e.g., a

available readily, the legality and compliance

news piece on a town’s go-green initiative or no

aspect of gathering and using data is additional

plastic day can be taken as additional adjustment

aspect that needs to be considered. Further,

parameter for non-green items in that area.

integrating information from multiple sources

Similarly, a weather forecast on warm front in

22

an area can automatically trigger reduction of

of data in rapid speed. Once data is massaged for

stocks of warm-clothing for stores there.

downstream systems, big analytics tools are used

A high-level logical view of Big data

to analyze. Based on business needs, real-time or

implementation is explained below to further

offline data processing/analytics can be used. In

understanding on how Big data can be assimilated

real-life scenarios, both these approaches are used

with traditional data sources. The data feeds

based on situation and need.

for the implementation comes from various

Proper analysis needs data not just

structured sources like forums, feedback forms,

from consumer insight sources but also from

rating sites and unstructured source like social

transactional data history and consumer

web, etc. as well as semi-structured data from

profiles.

emails, word documents, etc. This is a veritable data feast thrown compared to traditional

ACTIONIZE – BIG DATA TO BIG IDEAS

systems but it is important that we diet on such

This is the key part of the Big data cycle. Even

data and use only those feeds that create optimum

the best data cannot be substituted for timely

value. This is done through synergy of business

action. The technology and functional stacker

knowledge and processes specific to retailer and

will facilitate retailer getting proper insight

the industry segment the retailer operates in and

into key customer intent on purchase – what,

set of tools specialized in analyzing huge volume

where, why and at what price. By knowing this,

Best Sellers in Tablet PCs

Most Wished For in Tablet PCs

1. Kindle Fire HD 7”, Dolby Audio, Dual-Band Wi-Fi, 32GB

1. Kindle Fire HD 8.9”, 4G LTE Wireless, Dolby Audio, Dual-Band Wi-Fi, 32GB

2. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 16GB

2. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 32GB

3. Samsung Galaxy Tab 2 (7-Inch,Wi-Fi)

3. Kindle Fire, Full Color 7” Multi-touch Display, Wi-Fi

4. Samsung Galaxy Tab 2 (10.1-Inch, Wi-Fi)

4. Kindle Fire HD 7”, Dolby Audio, Dual-Band Wi-Fi, 32 GB

5. Kindle Fire HD 8.9”, Dolby Audio, Dual-Band Wi-Fi, 32 GB

5. Samsung Galaxy Tab 2 (7-Inch, Wi-Fi)

Figure 2: Correlation between Customer Ratings and Sales

Source: Reference [12]

23

the retailer can customize the 4Ps (product,

marketing [12] than from traditional

pricing, promotions and place) to create enough

advertising channels.

mindshare from customer perspective that sales become inevitable [10].

■■ Compliance: Governmental regulations

For example, a cursory look at random

and compliance requirements are

product category (tablet) in an online retailer

mandatory to avoid liability as co-

site shows the strong correlation between

mingling data from disparate sources

customer ratings and sales, i.e., 4 out of 6 best

can result in generation of personal data

user-rated products are in the top five in sales –

beyond the scope of the original user’s

a 60% correlation even when other parameters

intent. While data is available globally,

like brand, price, release date are not taken into

the use has to comply with local law of

consideration [Fig. 2] 12. The retailer knowing

the land and ensure it is done keeping in

the customer ratings can offer promotions

mind customer’s sensibilities.

that can tip the balance between sales and lost opportunity. While this example may not be the

■■ People, Process and Organizational

rule, the key to analysis and actionizing the data

Dynamics: The move to feedback economy

is to correlate the importance of user feedback

requires different organizational

data and concomitant sales.

mindset and processes. Decision making will need to be more bottom-up and

BIG DATA OPPORTUNITIES

collaborative. Retailers need to engage

The implication of Big data analytics on major

customer to ensure the feedback loop is

retailing processes will be along the following

in place. Further, Big data being cross-

areas.

functional, needs the active participation and coordination between various

■■ Identifying the Product Mix: The

departments in the organization; hence

assortment and allocation will need to

managing organizational dynamics is the

take into consideration the evolving user

key consideration.

trends identified from Big data analytics to ensure the offering matches the market

■■ B e t t e r C u s t o m e r E x p e r i e n c e :

needs. Allocation planning especially

Organizations can improve the overall

has to be tactical with shorter lead times.

customer experience by providing updates services and thereby eliminating

■■ Promotions and Pricing: Retailers have

surprises. For instance Big data solutions

to move from generic pricing strategies

can be used to pro-actively inform

to customized user specific.

customers of expected shipment delays based on traffic data, climate and other

■■ C o m m u n i c a t i o n w i t h C u s t o m e r :

external factors.

Advertising will move from mass media to personalized communication; from

BIG DATA ADOPTION STRATEGY

one way to two-way communication.

Presented below is a perspective on how to

Retailers will gain more from viral

adopt a Big data solution within the enterprise.

24

Define Requirements, Scope and Mandate:

Key Player: Data Analyst

Define mandate and objective in terms of what is the required from Big data solution. A guiding

Strategy to Actionize the Insights:

factor to identify the requirements would be the

Business should create process that would take

prioritized list of business strategies. As part of

these inferences as inputs to decision making.

initiation, it is important to also identify the goal

Stakeholders in decision making should be

and KPIs that vindicates the usage of Big data.

identified and actionable inferences have to be communicated at the right time. Speed is critical

Key Player: Business

to the success of Big data.

Choosing the Right Data Sources:

Key Player: Business

Once the requirement and scope is defined, the IT department has to identify the various

Measuring the Business Benefits:

feeds that would fetch the relevant data. These

The success of the Big data initiative depends on

feeds would be structured, semi structured and

the value it creates to the organization and its

unstructured. The source could be internal or

decision making body. It should also be noted

external. For internal sources, the policies and

that unlike other initiatives, Big data initiatives

processes should be defined to enable friction

are usually continuous process in search of the

less flow of data.

best results. Organizations should be in tune to this understanding to derive the best results. However, it is important that a goal is set and

Key Players: IT and Business

measured to track the initiative and ensure its movement in the right direction.

Choosing the Required Tools and Technologies: After deciding upon the sources of data that would feed the system, the right tools and

Key Players: IT and Business

technology should be identified and aligned with business needs. Key areas are capturing the

CONCLUSION

data, tools and rules to clean the data, identify

The move to feedback economy presents an

tools for real-time and offline analytic, identify

inevitable paradigm shift for the retail industry.

storage and other infrastructure needs.

Big data as the enabling technology will play key role in this transformation. As ever, business needs will continue to drive technology process

Key Player: IT

and solution. However, given the criticality of Creating Inferences from Insights:

Big data, organizations will need to treat Big

One of the key factors to a successful Big

data as an existential strategy and make the right

data implementation is to have a pool of

investment to ensure they can ride the wave.

talented data analyst who can create proper inferences from the insights and facilitate

REFERENCES

build and definition of new analytic models.

1. Customer dynamics. Available at http://

These models help in probing the data and

en.wikipedia.org/wiki/Customer_

understand the insights.

dynamics.

25

2. Davenport, T. and. Harris, G., (2007),

9. Gartner Says Solving ‘Big data’ Challenge

Competing on Analytics, Harvard

Involves More Than Just Managing

Business School Publishing.

Volumes of Data (2011). http://www.

3. D e B o r d e , M . , ( 2 0 0 6 ) , D o Y o u r

gartner.com/it/page.jsp?id=1731916.

Organizational Dynamics Determine Your

10. G e n s , F . ( 2 0 1 2 ) . I D C P r e d i c t i o n

Operational Success?, The O and P Edge.

2012: Competing for 2020. Available

4. Lemon, K., Barnett, T., White, Russell S.

at http://cdn.idc.com/research/

Winer, Dynamic Customer Relationship

Predictions12/Main/downloads/

Management: Incorporating Future

IDCTOP10Predictions2012.pdf.

Considerations into the Service Retention

11. Bhasin, H. 4Ps of marketing. Available

Decision, Journal of Marketing.

at http://www.marketing91.com/

5. Boyd, J. (September 3, 1976). OODA

marketing-mix-4-ps-marketing/.

loop, In Destruction and Creation.

12. Amazon US site / tablets category (2012).

Available at http://en.wikipedia.org/

Available at http://www.amazon.com/

wiki/OODA_loop.

gp/top-rated/electronics/3063224011/

6. Doyne, S. (2012), Should Companies

ref=zg_bs_tab_t_tr?pf_rd_

Collect Information About You?, NY

p=1374969722&pf_rd_s=right-

Times. Available at http://learning.

8&pf_rd_t=2101&pf_rd_i=list&pf_

blogs.nytimes.com/2012/02/21/

rd_m=ATVPDKIKX0DER&pf_rd_

should-companies-collect-information-

r=14YWR6HBVR6XAS7WD2GG.

about-you/.

13. Godwin, G. (2008) Viral marketing. Available

7. Data, data everywhere (2010), The

at http://sethgodin.typepad.com/seths_

Economist. Available at http://www.

blog/2008/12/what-is-viral-m.html.

economist.com/node/15557443.

14. Wang, R. (2012), Monday’s Musings:

8. IDC Digital Universe (2011). Available

Beyond The Three V’s of Big data –

at http://chucksblog.emc.com/

Viscosity and Virality , http://blog.

chucks_blog/2011/06/2011-idc-digital-

softwareinsider.org/2012/02/27/

universe-study-big-data-is-here-now-

mondays-musings-beyond-the-three-

what.html.

vs-of-big-data-viscosity-and-virality/

26

Infosys Labs Briefings VOL 11 NO 1 2013

Harness Big Data Value and Empower Customer Experience Transformation By Zhong Li PhD

Communication Service Providers need to leverage the 3M Framework with a holistic 5C process to extract Big Data value (BDV)

I

n today’s hyper-competitive experience

customers who can now choose from multiple

economy, communication service providers

channels to conduct business interactions.

(CSPs) recognize that product and price alone

A recent industry research indicates that

will not differentiate their business and brand.

some 90% of today’s consumers in the US

Since brand loyalty, retention and long-term

and West Europe interact across multiple

profitability are now so closely aligned with

channels, representing a moving target that

customer experience, the ability to understand

makes achieving a full view of the customer

customers, spot changes in their behavior

that much more challenging .

and adapt quickly to new consumer needs is

To compound this trend, the always-on

fundamental to the success of the consumer

digital customers continuously create more data in

driven Communication Service Industry.

various types, from many more touch points with

The increasingly sophisticated digital

more interaction options. CSPs encounter “Big

consumers demand more personalized

data phenomenon” by accumulating significant

services through the channel of their

amounts of customer related information such

choice. In fact, the internet, mobile and

as purchase patterns, activities on the website,

particularly, the rise of social media in the

from mobile, social media or interactions with

past 5 years have empowered consumers

the network and call centre.

more than ever before. There is a growing

Such Big data phenomenon presents

challenge for CSPs that are contending with

CSPs with challenges along 3V dimensions

an increasingly scattered relationship with

(Fig. 1), viz.,

27

CSPs of all sizes have learned the hard way that

Mobile

it is very difficult to take full advantage of all of the customer interactions in Big data if they do

Web

Variety

not know what their customers are demanding

Call centre

or what their relative value to the business is.

Value Volume

Store

Even some CSPs that do segment their customers

Velocity

with the assistance of customer relationship management (CRM) system struggle to take

Social

complete advantage of that segmentation in

Figure 1: Big Data in 3Vs is accumulated from Multiple Channels Source: Infosys Research

developing a real-time value strategy. In hypersophisticated interaction patterns throughout their journey spanning marketing, research, order, service and retention, Big data sheds

■■ Large Volume: Recent industry research

shining light to expose treasured customer

shows that the amount of data that

intelligence along aspects of 4Is viz., interest,

the CSP has to manage with consumer

insight, interaction and intelligence.

transaction and interaction has doubled in the past three years, and its growth is also

■■ Interest and Insight: Customers offer

in acceleration to double the size again

their attention for interest and share

in the next two years, much of it coming

their insights. They visit a web site,

from new sources including blogs, social

make a call, or access a retail store, share

media, internet search, and networks [7].

view on social media because they want something from CSP at that moment

■■ Broad Variety: The type, form and

– information about a product or help

format of data are created in a broad

with a problem. These interactions

variety. Data is created from multiple

present an opportunity for the CSP to

channels such as online, call centre,

communicate with a customer who is

stores and social media including

engaged by choice and ready to share

Facebook, Twitter and other social media

information regarding her personalized

platforms. It presents itself in a variety of

wants and needs.

types, comprising structured data form transaction, semi-structure data from call

■■ Interaction and Intelligence: It is typically

records and unstructured data in multi-

crucial for CSPs to target offerings to

media forms from social interactions

particular customer segments based on the intelligence of customer data. The

■■ Rapidly Changing Velocity: The always

success of these real time interactions –

on digital consumers create change

whether through online, mobile, social

dynamics of data in the speed of light.

media, or other channels depends to a

They equally demand fast response from

great extent on the CSP’s understanding

CSPs to satisfy their personalized needs

of the customer’s wants and needs at the

in real time.

time of the interaction.

28

Therefore, alongside managing and securing Big data in 3V dimensions, CSPs are facing a

Correlate

fundamental challenge on how to explore and harness Big data Value (BDV).

Product

Converge

Promotion Customer

A HOLISTIC 5C PROCESS TO HARNESS BDV

Control Order

Service

Rising to the challenges and leveraging on the

Collect

Collaborate

opportunity in Big data, CSPs need to harness BDV with predictive models to provide deeper

Figure 2: Harness BDV with a Holistic 5C Process Source: Infosys Research

insight into customer intelligence from profiles, behaviours and preferences that are hidden in Big data of vast volume and broad variety, and to deliver superior personalized experience

The holistic 5C process will help CSPs to

with fast velocity in real time throughout entire

aggregate the whole interaction with a customer

customer journey.

across time and channels, support with large

In the past decade, most CSPs have

volume and broad variety of data including

invested significant amount of efforts in the

promotion, product, order and services, define

implementation of complex CRM systems to

interactions with that of customer’s preferences.

manage customer experience. While those CRM

The context of the customer’s relationship with

systems bring efficiency in helping CSPs to

the CSP, and actual and potential value that she

deliver on “what” to do in managing historical

derives, in particular, determine the likelihood

transactions, they lack the crucial capability

that she consumer will take particular actions

of defining “how” to act in time with the most

based on real time intelligence. Big data can

relevant interaction to maximize the value for

help the CSP correlate the customer’s needs with

the customer.

product, promotion, order, service and deliver

CSPs now need to look beyond what

the right offer at the right time in the appropriate

CRM has to offer and dive deeper to cover

context that she is most likely to respond to.

“how” to do things right for the customer by capturing customers’ subjective sentiment

AN OVERARCHING 3M FRAMEWORK

in a particular interaction, resultant insight

TO EXTRACT BDV

into predication on what a customers demand

To execute a holistic 5C process for Big data,

from CSPs and trigger proactive action to

CSPs need to implement an overarching

satisfy their needs, which is more likely

framework that integrates the various pools

to lead to customer delight and ultimate

of customer related data residing in CSPs

revenues.

enterprise systems, create an actionable customer profile, deliver insight based on

■■ To do so, CSPs needs to execute a

that profile in real time customer interaction

holistic 5C process, i.e., collect, converge,

event and effectively match sales and service

correlate, collaborate and control, in

resources to take proactive actions, so as to

extracting BDV (Fig. 2).

monetize ultimate value on the fly.

29

The overarching framework needs to incorporate

The 3M framework needs to be based on an

3M modules, i.e. Model, Monitor and Mobilize

event-driven architecture (EDA) incorporating Enterprise Service Bus (ESB) and Business

■■ Model Profile: It models customer

Process Management (BPM) and should be

profile based on all the transactions that

application and technology agnostic. It needs

helps CSPs gain insight at the individual-

to interact with multiple channels using events;

customer level. Such a profile requires

match patterns of a set of events with pre-defined

not only integration of all customer

policies, rules, and analytical models; deliver a set

facing systems and enterprise systems,

of automations to fulfil personalized experience

but integration with all the customer

that spans the complete customer lifecycle.

interactions such as email, mobile, online and social in enterprise systems such as

Furthermore, the 3M framework needs to

OMS, CMS, IMS and ERP in parallel with

be supported with key high-level functional

CRM paradigm, and model an actionable

components, which include:

customer profile to be able to effectively deploy resources for a distinct customer

■■ Customer Intelligence from Big data:

experience.

A typical implementation of customer inte llige nc e fro m Big data is the

■■ Monitor Pattern: It monitors customer

combination of Data Warehouse and real

interaction events from multiple touch

time customer intelligence analytics. It

points in real time, dynamically senses

requires aggregation of customer and

and triggers matching patterns of events

product data from CSP’s various data

with the defined policies and set models,

sources in BSS/OSS, leveraging CSP’s

and makes suitable recommendations

existing investments with data models,

and offers at right time through an

workflows, decision tables, user interface,

appropriate channel. It enables CSPs

etc. It also integrates with the key modules

to quickly respond to changes in the

in CSP’s enterprise landscape, covering:

marketplace—a seasonal change in demand, for example—and bundle

■■ C u s t o m e r

Management:

offerings that will appeal to a particular

A complete customer relationship

customer, across a particular channel, at

management solution combines a

a particular time.

360 degree view of the customer with intelligent guidance and

■ ■ Mobilize Process: It mobilizes a set

seamless back-office integration

of automations that allows customers

to increase first contact resolution

enjoy the personalized engaging

and operational efficiency.

journey in real time that spans outbound and inbound communications, sales,

■■ O f f e r

Management:

orders, service and help intervention,

CSP-specific specialization

and fulfil customer’s next immediate

and re-use capabilities that

demand.

define new services, products,

30

bundles, fulfilment processes

insights to enable intelligently

and dependencies and rapidly

driven marketing campaigns

capitalize on new market

to develop, define and refine

opportunities and improve

marketing messages and target

customer experience.

customer with a more effective planand meet customers at the

■■ O r d e r

Management:

touch points of their choosing

The configurable best practices for

through optimized display and

creating and maintaining holistic

search results while generating

order journey that is critical to the

demand via automated email

success of such product-intensive

creation, delivery and results

functions as account opening, quote

tracking.

generation, ordering, contract generation, product fulfilment and

■■ R e t e n t i o n

service delivery.

Management:

Customers offer their attention, either intrusively or non-

■■ Service Management: Case based

intrusively to look for the products

work automation and a complete

and services that meet their needs

view of each case enables an

through the channel of their

effective management of every

choices. It dynamically captures

case throughout its lifecycle.

consumer data from highly active and relevant outlets such as social

■■ Event Driven Process Automation:

media, websites and other social

A dynamic process automation engine

sources and enables CSPs to

empowered with EDA leverages the

quickly respond to customer needs

context of the interaction to orchestrate

and proactively deliver relevant

the flow of activities, guiding customer

offers for upgrades and product

service representatives (CSRs) and self-

bundles that take into account each

service customers through every step in

customer’s personal preference.

their inbound and outbound interactions, in particular for Campaign Management

■■ Experience Personalization: It provides

and Retention Management.

the customer with personalized, relevant experience, enabled from business process

■■ Campaign Management: Outbound

automation that connects people, processes

interactions are typically used

and systems in real time and eliminates

to target products and services

product, process and channel silos. It helps

to particular customer segments

CSPs extend predictive targeting beyond

based on analysis of customer data

basic cross-sells to automate more of their

through appropriate channels.

cross-channel strategies and gain valuable

It uncovers relevant, timely and

insights from hidden, consuming and

actionable consumer and network

interaction patterns.

31

Overall, the 3M framework will empower BDV

store to contact centre to Web to social

solution for CSP to execute on the real-time

media, it helps CSPs deliver a new

decision that aligns individual needs with

standard of branded, consistent customer

business objectives and dynamically fulfils the

experiences that build deeper, more

next best action or offer that will increase the

profitable and lasting relationships. It

value of each personalized interaction.

enables CSPs to maximize productivity by handling customer interactions as fast as possible in the most profitable channel.

BDV IN ACTION- CUSTOMER EXPERIENCE OPTIMIZATION

At every point in the customer lifecycle, from By implementing the proposed BDV solution,

marketing campaigns, offer and order to

CSPs can optimize customer experience that

servicing and retention efforts, BDV helps to

delivers the right interaction with each customer

inform its interactions with that customer’s

at right time so as to build strong relationships,

preferences, the context of her relationship with

reduce churn, and increase customer value to

the business, and actual and potential value,

the business.

enables CSPs focus on creating personalized experiences that balance the customer’s needs

■■ From Customer Experience Perspective:

with business values.

It provides CSP with real-time, endto end visibility into all the customer

■■ Campaign Management: BDV delivers

interaction events taking place across

focused campaigns on the customer with

multi-channels, by correlating and

predictive modelling and cost-effective

analyzing these events, using a set of

campaign automation that consistently

business rules, and automatically takes

distinguishes the brand and supports

proactive actions which ultimately lead

personalized communications with

to customer experience optimization.

prospects and customers.

It helps CSP turn their multi-channel contacts with customers into cohesive,

■■ Offer Management: BDV dynamically

integrated interaction patterns, allowing

generates offers that account for such

them to better segment their customers

factors as the current interaction with

and ultimately to take full advantage of

the customer, the individual’s total value

that segmentation, deliver personalized

across product lines, past interactions,

experiences that are dynamically tailored

and likelihood of defecting. It helps

to each customer while dramatically

deliver optimal value and increases the

improving interaction effectiveness and

effectiveness of propositions with next-

efficiency.

best-action recommendations tailored to the individual customer.

■■ From CSPs Perspective: It helps CSPs quickly weed out underperforming

■■ Order Management: BDV enables the

campaigns and learn more about their

unified process automation applicable

customers and their needs. From retail

to multiple product lines, with agile and

32

flexible workflow, rules and process

that streamlines complex interactions; and

orchestration that accounts for the

automate interactions from end-to-end. The

individual needs in product pricing,

result is an optimized customer experience

configuration, processing, payment

that helps CSPs substantially increase customer

scheduling and delivery.

satisfaction, retention and profitability, and consequently empowers CSPs evolving into

■■ Service Management: BDV empowers

the experience centric Tomorrow’s Enterprise.

customer service representatives to act based on the unique needs and

REFERENCES

behaviours of each customer using real-

1. IBM Big data solutions deliver insight

time intelligence combined with holistic

and relevance for digital media – Solution

customer content and context.

Brief- June 2012 available at www-05. ibm.com/fr/events/netezzaDM.../

■■ Retention Management: BDV helps

Solutions_Big_Data.pdf.

CSPs retain more high-value customers

2. Oracle Big data Premier-Presentation

with targeted next-best-action

(May 2012). Available at http://

dialogues. It consistently turns customer

premiere.digitalmedianet.com/articles/

interactions into sales opportunities

viewarticle.jsp?id=1962030.

by automatically prompting customer

3. SAP HANA™ for Next-Generation

service representatives to proactively

Business Applications and Real-Time

deliver relevant offers to satisfy each

Analytics (July 2012). Available at http://

customer’s unique need.

www.saphana.com/docs/DOC-1507. 4. SAS® High-Performance Analytics (June

CONCLUSION

2012). Available at http://www.sas.

Today’s increasingly sophisticated digital

com/reg/gen/uk/hpa?gclid=CJKpvv

consumers expect CSPs to deliver product,

CJiLQCFbMbtAodpj4Aaw.

service and interaction experience designed

5. Transform the Customer Experience

“just for me at this moment.” To take on the

with Pega-CRM (2012). Available at

challenge, CSPs need to deliver customer

http://www.pega.com/sites/default/

experience optimization powered by BDV in

files/private/Transform-Customer-

real time.

Experience-with-Pega-CRM-WP-

By implementing an overarching 3M

Apr2012.pdf.

BDV framework to execute a holistic 5C process

6. The Forrester Wave™: E n t e r p r i s e

new products can be brought to market with

Hadoop Solutions for Big data-Feb 2012.

faster velocity and with the ability to easily

Available at http://center.uoregon.

adapt common services to accommodate unique

edu/AIM/uploads/INFOTEC2012/

customer and channel needs.

HANDOUTS/KEY_2413506/ Infotec2012BigDataPresentationFinal.

Suffice it to say that BDV will enable

pdf.

CSP to deliver customer-focused experience that matches responses to specific individual

7. S h a h S . ( 2 0 1 2 ) , T o p 5 R e a s o n s

demands; provide real time intelligent guidance

Communications Service Providers

33

Need Operational Intelligence. Available

8. Connolly S. and Wooledge S. (2012),

at http://blog.vitria.com/bid/88402/

Harnessing the Value of Big data Analytics.

Top-5-Reasons-Communications-Service-

Available at http://www.asterdata.com/

Providers-Need-Operational-Intelligence.

wc-0217-harnessing-value-bigdata/.

34

Infosys Labs Briefings VOL 11 NO 1 2013

Liquidity Risk Management and Big Data: A New Challenge for Banks By Abhishek Kumar Sinha

Implement a Big Data framework and manage your liquidity risk better

D

uring the 2008 financial crisis, banks

until the 2007 crisis struck. The source of

faced an enormous challenge of managing

funding was mostly wholesale funding and

liquidity and remaining solvent. As many

capital market funding. Hence in the 2008 crisis,

financial institutions failed, those who survived

when these funding avenues dried up across

the crisis have fully understood the importance

the globe, it was unable to fund its operations.

of liquidity risk management. Managing

During the crisis, the bank’s stock fell 32% along

liquidity risk on simple spreadsheets can lead

with depositors run on the bank. The central

to non-real-time and inappropriate information

bank had to intervene and support the bank

that may not be enough for efficient liquidity

in the form of deposit protection and money

risk management (LRM). Banks must have

market operations. Later the Government took

reliable data on daily positions and other

the ultimate step of nationalizing the bank.

liquidity measures that have to be monitored

Lehman Brothers had 600 billion in

continuously. During signs of stress, like

assets before its eventual collapse. The bank’s

changes in liquidity of various asset classes

stress testing omitted its riskiest asset -- the

and unfavorable market conditions, banks need

commercial real estate portfolio, which in

to react to these changes in order to remain

turn led to misleading stress test results. The

credible in the market. In banking liquidity

liquidity of the bank was very low compared to

risk and reputation is so heavily linked to the

the balance sheet size and the risks it had taken.

extent that even a single liquidity event can lead

The bank had used deposits with clearing banks

to catastrophic funding problems for a bank.

as assets in its liquidity buffer which was not in compliance with the regulatory guidelines.

MISMANAGEMENT OF LIQUIDITY RISK:

The bank lost 73% in share price during the

SOME EXAMPLES OF FAILURES

first half of 2008, and filed for bankruptcy in

Northern Rock was a star performer UK bank

September 2008.

35

2008 financial crisis has shown that

necessary actions in time. The various liquidity

the current liquidity risk management (LRM)

parameters can be changing funding costs,

approach is highly unreliable in a changing and

counterparty risks, balance sheet obligations,

difficult macroeconomic atmosphere. The need

and quality of liquidity in capital markets.

of the hour is to improve operational liquidity management on a priority basis.

THE NEED OF A READY-MADE SOLUTION In a recent Swift survey, 91% respondents

THE CURRENT LRM APPROACH AND

indicated that there is a lack of ready-made

ITS PAIN POINTS

liquidity risk analytics and business intelligence applications to complement risk integration

Compliance/Regulation

processes. Since we can see that the regulation

Across global regulators, LRM principles have

around the globe in form of Basel III, Solvency

become stricter and complex in nature. The

II, CRD IV, etc., are shaping up hence there is

regulatory focus is mainly on areas like risk

an opportunity to standardize the liquidity

governance, measurement, monitoring and

reporting process. A solution that can do this

disclosure. Hence, the biggest challenge for

can be of great help to banks as it would save

the financial institutions worldwide is to react

them both effort and time, as well as increase

to these regulatory measures in an appropriate

the efficiency of reporting. Banks can focus

and timely manner. Current systems are not

solely on the more complex aspects like inputs

equipped enough to handle these changes. For

to the stress testing process and on business and

example, LRM protocols for stress testing and

strategy to control liquidity risk. Even though

contingency funding planning (CFP) focus

there can be differences in approach of various

more on the inputs to the scenario analysis and

banks in managing liquidity, these changes

new stress testing scenarios. These complex

can be incorporated in the solution as per the

inputs need to be very clearly selected and

requirements.

hence it poses a great challenge for the financial institution.

CHALLENGES/SCOPE OF REQUIREMENTS FOR LRM

Siloed Approach to Data Management

The scope of requirements for LRM ranges

Many banks use a spreadsheet-based LRM

from concentration analysis of liquidity

approach that gets data from different sources

exposures, calculation of average daily peak

which are neither uniform nor comparable.

of liquidity usage, historical and future view

This leads to a great amount of risk in manual

of liquidity flows on both contractual and

processes and data quality issues. In such

behavioral in nature, collateral management,

a scenario, it becomes impossible to collate

stress testing and scenario analysis, generate

enterprise wide liquidity position and the risk

regulatory reports, liquidity gap across buckets,

remains undetectable.

contingency fund planning, net interest income analysis, fund transfer pricing, to capital

Lack of Robust LRM Infrastructure

allocation. All these liquidity measures are

There is a clear lack of a robust system which

monitored and alerts generated in case of

can incorporate real-time data and generate

thresholds breached.

36

Concentration analysis of liquidity exposures

Regulatory liquidity reports have Basel III

shows some important points on whether

liquidity ratios like liquidity coverage ratio

the assets or liabilities of the institution are

(LCR), net stable funding ratio (NSFR), FSA and

dependent on a certain customer, or a product

Fed 4G guidelines, early warning indicators,

like asset or mortgage backed securities. It also

funding concentration, liquidity asset/

tries to see if the concentration is region wise

collateral, and stress testing analysis. Timely

country wise, or by any other parameter that can

completion of these reports in the prescribed

be used to detect a concentration for the overall

format is important for financial institutions to

funding and liquidity situation.

remain complaint with the norms.

Calculation of average daily peak of liquidity

Net interest income analysis (NIIA), FTP and

usage gives a fair idea of the maximum intraday

capital allocation are performance indicators

liquidity demand and the firm can keep

for an institution that raises money from

necessary steps to manage the liquidity in ideal

deposits or other avenues and lends it to

way. The idea is to detect patterns and in times

customers, or performs an investment to

of high, low or medium liquidity scenarios

achieve a rate of return. The NII is the difference

utilize the available liquidity buffer in the most

between the cost of funds to the interest rate

optimized way.

achieved by lending or investing the same. The implementation of FTP links the liquidity risk/

Collateral management is very important

market risk to the performance management

as the need for collateral and its value

of the business units. The NII analysis helps in

after applying the required haircuts has

predicting the future state of the P/L statement

to be monitored on a daily basis. In case

and balance sheet of the bank.

of unfavorable margin calls the amount of collateral needs to be adjusted to avoid default

Contingency fund planning contains of

in various outstanding positions.

wholesale, retail and other funding reports in areas of both secured and unsecured funds, so

Stress testing and scenario analysis is like a

that in case of these funding avenues drying up

self-evaluation for the banks, in which they

banks can look for other alternatives. It states

need to see how bad things can go in case of

the reserve funding avenues like use of credit

high stress events. Internal stress testing is

lines, repro transactions, unsecured loans, etc.,

very important to see the amount of loss in case

that can be accessed timely and at a reasonable

of unfavorable events. For the systematically

cost in liquidity crisis situation.

important institutions, regulators have devised some stress scenarios based on the past crisis

Intra-group borrowing and lending reports

events. These scenarios need to be given as an

show the liquidity position across group

input to the stress tests and the results have

companies. Derivatives reports related to

to be given to the regulators. A proper stress

market value, collateral and cash flows are very

testing ensures that the institution is aware

important to an efficient derivatives portfolio

of what risk it is taking and what can be the

management. Bucket-wise and cumulative

consequences of the same.

liquidity gap under business as usual and stress

37

and must have the autonomy to take liquidity

Corporate Governance

Identify & Assess Liquidity Risk Monitor & Report

defining the liquidity risk policy in a clear

Take Corrective Measures

Strategic Level Planning

decisions. Strategic level planning helps in manner related to the overall business strategy of the firm. The risk appetite of the firm needs to be mentioned in measurable terms and the same has to be communicated to all the stakeholders

Periodic Analysis for Possible Gaps

in the firm. Liquidity risks across the business Figure 1: Iterative Framework for effective liquidity risk management Source: Infosys Research

need to be identified and the key risk indicators and metrics are to be decided. Risk indicators are to be monitored on a regular basis, so that in the case of an upcoming stress scenario

scenario situations give a fair idea of varying

preemptive steps can be taken. Monitoring and

liquidity across time buckets. Both contractual

reporting is to be done for internal control as

and behavioral cash flows are tracked to get the

well as for the regulatory compliance.

final inflow and outflow scenario. This is done

Finally there has to be a periodic analysis

over different time periods, like 30 days to 3

of the whole system in order to identify possible

years to get a long term as well as short term

gaps in it and the frequency of review has to be

view of liquidity. Historic cash flows are tracked

at least once in a year and in case of extreme

as they help in modeling the future behavioral

markets scenarios more frequently.

cash flows. Historical assumptions plus current

To satisfy the scoped out requirements

market scenarios are very important in dynamic

we can see that the data from various sources

analysis of behavioral cash flows. Other

is used to form liquidity data warehouse and

important reports are related to available pool

datamart which acts as an input to the analytical

of unencumbered assets and non-marketable

engines.

assets.

The engines contain business rules and All the scoped requirements can only

logic based on which the key liquidity parameters

be satisfied when the firm has a framework

are calculated. All the analysis is presented in

in place to take necessary decisions related

report and dashboards form for both regulatory

to liquidity risk. Hence, next we would have

compliance and internal risk management as well

a look into a LRM framework and as well as

as for decision making purposes.

a data governance framework for managing liquidity risk data.

Some Uses of Big data Application in LRM 1. S t a g i n g A r e a C r e a t i o n f o r D a t a

LRM FRAMEWORK

Warehouse: Big data application

Separate group for LRM that is a constituted

can store huge volumes of data and

of members from the asset liability committee,

perform some analysis on it along with

risk committee and top management needs

aggregating data for further analysis.

to be formed. This group must function

Due to its fast processing for large

independent of the other groups in the firm

amount of data it can be used as loader to

38

Data Store

Data Sources

Market Data

Data Warehouse

Reference data Load System of Records Collateral, Deposits, Loans, Securities, Product/LOB

Big Data Application Data quality/ Data checks/ Operational Data Store/ Staging Layer

ETL

DataMart ETL

General Ledger Reconciliation

Analytical Engine Asset Liability Management . Fund Transfer Pricing Liquidity Risk & Capital Calculation

General Ledger

External Data

Figure 2: LRM data governance framework for Analytics and BI with Big data capabilities

Reporting / BI Regulatory Reports Basel related ratios NSFR & LCR FED 4G FSA reports Stress testing Reports Regulatory capital allocation Internal Liquidity related Reports Net interest income analysis ALM reports FTP & liquidity costs Funding Concentration Liquid assets Capital allocation & planning Internal stress test Key risk indicators Other reports

Source: Infosys Research

load data into the data warehouse along

Billions of records can now be processed

with facilitating the extract-transform-

at increasingly amazing speeds.

load (ETL) processes. HOW BIG DATA CAN HELP IN LRM 2. Preliminary Data Analysis: Data can be

ANALYTICS AND BI

moved in from various sources and then using a visual analytics tool to create a

■■ Operational efficiency and swiftness is a

picture of what data is available and how

point where high performance analytics

it can be used.

can help to achieve faster decision making because all the required analysis

3. Making Full enterprise Data Available for

is obtained much faster.

High performance Analytics: Analytics at large firms were often limited to

■■ Liquidity risk is a killer in today’s

the sample set of records on which

financial world and is most difficult to

the analytical engines would run and

tracks as for large banks have diverse

provide certain results, but as a Big data

instruments and a large number of

application provides distributed parallel

scenarios need to be analyzed like

processing capacity the limitation of

changes in interest rates, exchange

number of records is non-existent now.

rates, liquidity and depth in the markets

39

worldwide, and for such dynamic

possible with Big data applications. All in

analysis Big data analytics is a must.

the banking industry know that the future is uncertain and high margins will always be a

■■ Stress testing and scenario analysis, both

challenge, so an efficient data management

require intensive computing as lot of

along with Big data capabilities needs to be in

data is involved hence faster scenario

place. This will add value to the banks profile

analysis means quick action in case of

by clear focus on the new opportunities for

stressed market conditions. With Big

banks and bring predictability to their overall

data capabilities scenarios that would

businesses.

takes hours to otherwise run can now

Successful banks in future would be the

be run in minutes and hence aid in quick

ones who take LRM initiatives seriously and

decision making and action.

implement the system successfully. Banks with an efficient LRM system would definitely build

■■ Efficient product pricing can be achieved

a strong brand and reputation in the eyes of

by implementing real time fund transfer

investors, customers, and regulators around

pricing system and profitability

the world.

calculations. This ensures the best possible pricing of market risks along

REFERENCES

with adjustments like liquidity premium

1. Banking on Analytics: How High-

across the business units.

Performance Analytics Tackle Big data Challenges in Banking (2012), SAS white

CONCLUSION

paper. Available at http://www.sas.com/

The LRM system is the key for a financial

resources/whitepaper/wp_42594.pdf.

institution to survive in competitive and highly

2. New regime, rules and requirements —

unpredictable financial markets. The whole idea

welcome to the new liquidity, Basel lll:

of managing liquidity risk is to know the truth,

implementing liquidity requirements,

and be ready for the worst market scenarios.

ERNST & YOUNG (2011).

This predictability is what is needed, and can

3. Leveraging Technology to Shape the

save a bank in times like the 2008 crisis. Even

future of Liquidity Risk Management,

at the business level a proper LRM system can

Sybase Aite. Group study, July, 2010.

help in better product pricing using FTP, and

4. Managing liquidity risk, Collaborative

hence pricing can be logical and transparent.

solutions to improve position management

Traditionally data has been a headache

and analytics (2011), SWIFT white paper.

for banks and is seen more as compliance and

5. Principles for Sound Liquidity Risk

regulation requirement, but going forward

Management and Supervision, BIS

there are going to be even more stringent

Document, (2008).

regulations and reporting standards across

6. Technology Economics: The Cost of

the globe. After the crisis of 2008 new Basel III

Data, Howard Rubin, Wall Street and

liquidity reporting standards, newer scenarios

Technology Website, Available at http://

for stress testing have been issued that requires

www.wallstreetandtech.com/data-

extensive data analysis and can only be timely

management/231500503.

40

Infosys Labs Briefings VOL 11 NO 1 2013

Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor By Anil Radhakrishnan and Kiran Kalmadi

Diagnose, customize and administer health care on real time using BDMEiC

I

magine a world, where the day to day data

RAMPANT HEALTHCARE COSTS

about an individual’s health is tracked,

A look at the healthcare expenditure of

transmitted, stored, analyzed on a real-time

countries like US and UK, would automatically

basis. Worldwide diseases are diagnosed at an

explain the burden that healthcare is on the

early stage without the need to visit a doctor.

economy. As per data released by Centers

And lastly a world, where every individual

for Medicare and Medicaid Services, health

will have a ‘life certificate’ that contains all

expenditure in the US is estimated to have

their health information, updated on a real

reached $2.7 trillion or over $8,000 per person

time basis. This is the world, to which Big data

[1]. By 2020, this is expected to balloon to $4.5

can lead us to.

trillion [2]. These costs will have a huge bearing

Given the amount of data generated for

on an economy that is struggling to get up on

e.g., , body vitals, blood samples, etc., every day

its feet, having just come out of a recession.

in the human body, it’s a haven for generating

According to the Office for National

Big data. Analyzing this Big data in healthcare is

Statistics in the UK, healthcare expenditure in

of prime importance. Big data analytics can play

UK amounted to £140.8 billion in 2010; from

a significant role in the early detection/advanced

£136.6 billion in 2009 [3]. With rising healthcare

diagnosis of such fatal diseases that which can

cost, countries like Spain have already pledged

reduce health care cost and improve quality.

to save €7 Billion by slashing health spending,

Hospitals, medical universities,

while also charging more for drugs [5]. Middle

researchers, insurers will be positively impacted

income earners will now have to pay more for

on applying analytics on this Big data. However,

drugs.

the principal beneficiaries of analyzing this Big

This increase in healthcare costs is not

data will be the Government, patients and

isolated to a few countries alone. According to

therapeutic companies.

World Health Organization statistics released

41

in 2011, per capita total expenditure on health

USING BIG DATA ANALYTICS FOR

jumped from US$ 566 to US$ 899 from 2000

PERSONALIZING DRUGS

to 2008, an alarming increase of 58% [4]. This

The patents of many high profile drugs are

huge increase is testimony to the fact that far

ending by 2014. Hence, therapeutic companies

from increasing steadily, healthcare costs have

need to examine the response of patients to

been increasing exponentially.

these drugs to help create personalized drugs.

While healthcare costs have been

Personalized drugs are those that are tailored

increasing, the data generated through body

according to an individual patient. Real time

vitals, lab reports, prescriptions, etc. has also

data collected from various patients will help

been increasing significantly. Analysis of this

generate Big data, the analysis of which will

data will lead to better and advanced diagnosis,

help identify how individual patients, reacted

early detection and more effective drugs which

to the drugs administered to them. By this

in turn will result in significant reduction in

analysis, therapeutic companies will be able

healthcare costs.

to create personalized drugs custom-made to an individual. A personalized drug is one of the

HOW BIG DATA ANALYTICS CAN HELP REDUCE HEALTHCARE COSTS?

important solutions that Big data analytics will

Analysis of ‘Big data’ that is generated from

have the power to offer. Imagine a situation

various real time patient records possesses a

where, analytics will help determine the exact

lot of potential for creating quality healthcare

amount and type of medicine that an individual

at reduced costs. Real time refers to data like

would require, even without them having to

body temperature, blood pressure, pulse/

visit a doctor. That’s the direction in which

heart rate, and respiratory rate that can

Big data analytics in healthcare has to move.

be generated every 2-3 minutes. This data

In addition, the analytics of this data can also

collected across individuals provides the

significantly reduce healthcare costs that run

volume of data at a high velocity, while also

into billions of dollars every year.

providing the required variety since it is obtained across geographies. The analysis

BIG DATA ANALYTICS FOR REAL TIME

of this data can help in reducing costs by

DIAGNOSIS USING BIG DATA MEDICAL

enabling real time diagnosis, analysis and

ENGINE IN THE CLOUD (BDMEIC)

medication, which offers

Big data analytics for real time diagnosis are characterized by real time Big data analytics

■■ Improved insights into drug effectiveness

systems. These systems contain a closed loop

■■ Insights for early detection of diseases

feedback system, where insights from the

■■ Improved insights into origins of various

application of the solution serve as feedback

diseases

for further analysis. (Refer Figure 1).

■■ Insights to create personalized drugs.

Access to real time data provides a quick way to accumulate and create Big data.

These insights that Big data analytics

The closed loop feedback system is important

provides are unparalleled and go a long way

because it helps the system in building its

in reducing the cost of healthcare.

intelligence. These systems can not only help

42

which is synced with the patch.

Real Time Medical Data

The extraction of the data happens at regular intervals (every 2-3 minutes).

New solution based on analysis

The smartphone transmits the real time data to the data center in the medical

Real Time Big Data Analytics system Analysis of real time data

Feedback

e n g i n e . The thigh based electronic medical patch is used for providing

Newer Insights from Solutions

medication. The patch comes with a drug cartridge (pre-loaded drugs) that

Figure 1: Real Time Big Data Analytics System Source: Infosys Research Source: Infosys Research

can be inserted into a slot in the patch. When it receives data from the smartphone, the device can provide the required medication to the patient

to monitor patients in real time but can also

through auto-injectors that are a part of

be used to provide diagnosis, detect early and

the drug cartridge.

deliver medication in real time. This can be achieved through a Big data

2. Data Center

Medical Engine in the Cloud (BDMEiC) [Fig. 2].

The data center is the Big data cloud storage that receives real time data from

This solution would consist of:

the medical patch and stores it. This data

■■ Two medical patches (arm and thigh)

center will be a repository of real time

■■ Analytics engine

data received across different individuals

■■ Smartphone

across geographies. This data is then

■■ Data Center.

transmitted to the Big data analytics engine

As depicted above, the BDMEiC solution

3. Big Data Analytics Engine

consists of the following:

The Big data analytics engine performs three major functions - analyzing data, sharing analyzed data with organizations

1. Arm and thigh based electronic medical

patch

and transmitting medication instructions

An arm based electronic medical patch

back to the smartphone.

(these patches are thin, lightweight, elastic and have embedded sensors) that

• Analyzing Data: It analyzes the

can monitor the patient is strapped to the

data (like body temperature, blood

arm of an individual , which reads vitals

pressure, pulse/heart rate, and

like body temperature, blood pressure,

respiratory rate, etc.) received

pulse/heart rate, and respiratory rate to

from the data center using its

monitor brain, heart, muscle activity, etc.

inbuilt medical intelligence, across individuals. As the system keeps

The patch then transmits this real time

analyzing this data it also keeps

data to the individual’s smartphone

building on its intelligence.

43

Real time Medication

Organizations Medical Engine 2

Data Center

With the analytics engine, monitoring patient data in real time, the diagnosis and treatment of

Medical Labs

Analytics 3 Engine

patients in real time is possible. With the data

Medical Universities

being shared with top research facilities and

Medical Research Centers

1 4

medical institutions in the world, the diagnosis and treatment would be more effective and

Therapeutic Companies

accurate.

Figure 2: Big Data Medical Engine in the Cloud (BDMEiC) Source: Infosys Research

Specific Instances: Blood pressure data can be monitored real time and stored in the data center. The analysis of this data by the analytics engine can keep the patients as well as doctor updated real time, if the blood pressure moves

• Sharing Analyzed Data: The analytics

beyond permissible limits.

engine also transmits its analysis to various universities, medical centers, therapeutic companies and other

Beneficiaries: Patients, medical institutions and

related organizations for further

research facilities.

research. Convenience • T r a n s m i t t i n g

Medication

The BDMEiC solution offers convenience to

Instructions: The analytics engine

patients, who would not always be in a position

also can transmit medication

to visit a doctor.

instructions to an individual’s smartphone, which in turn

Specific Instances: Body vitals can be measured

transmits data to the thigh patch,

and analyzed with the patient being at home.

whenever medication has to be

This especially helps in the case of senior citizens

provided.

and busy executives who can now be diagnosed and treated right at home or while on the move.

The BDMEiC solution can act as a Beneficiaries: Patients.

real time doctor that diagnoses, analyzes, and provides personalized medication to individuals. Such a solution that harnesses the

Insights into drug effectiveness

potential of Big data provides manifold benefits

The system allows doctors, researchers and

to various beneficiaries.

therapeutic companies to understand the impact of their drugs in real time. This helps them to create better drugs in the future.

BENEFITS AND BENEFICIARIES OF BDMEIC The BDMEiC solution if adopted in a large scale

Specific Instances: The patents of many high

manner can offer a multitude of benefits, few of

profile drugs are ending by 2014. Therapeutic

which are listed below.

companies can use BDMEiC to perform real

44

Beneficiaries: Patients and doctors

time Big data analysis, to understand their existing drugs better, so that they can create better drugs in the future.

Reduced Costs Real time data collected from BDMEiC assists in

Beneficiaries: Doctors, researchers and

the early detection of diseases, thereby reducing

therapeutic companies

the cost of treatment.

Early Detection of Diseases

Specific Instances: Early detection of cancer and

As BDMEiC monitors, stores, and analyzes data

other life threatening diseases can lead to lesser

in real time, it allows medical researchers, doctors

spending on healthcare.

and medical labs to detect diseases at an early stage. This allows them to provide an early cure.

Beneficiaries: Government and patients.

Specific Instances: Early detection of diseases

CONCLUSION

like cancer, childhood pneumonia, etc., using

The present state of the healthcare system

BDMEiC can help provide medication at an

leaves a lot to be desired. Healthcare costs

early stage thereby increasing the survival rate.

are spiraling and forecasts suggest that they are not poised to come down any time soon.

Beneficiaries: Researchers, medical Labs and

In such a situation, organizations world over,

patients.

including governments should look to harness the potential of real time Big data analytics

Improved Insights into Origins of Various

to provide high quality and cost effective

Diseases

healthcare. vThe solution proposed in this

With BDMEiC storing and analyzing real time

paper, tries to utilize this potential to bridge

data, researchers get to know the cause and

the gap between medical research, and the final

symptoms of a disease much better and at an

delivery of the medicine.

early stage. REFERENCES Specific Instances: Newer strains of viruses can

1. US Food and Drug Administration, 2012

be monitored and researched in real time.

2. National Health Expenditure Projections 2011-2021 (January 2012), Centers for

Beneficiaries: Researchers and medical labs.

Medicare & Medicaid Services, Office of the Actuary. Available at http://www.

Insights to Create Personalized Drugs

cms.gov/Research-Statistics-Data-

Real time data collected from BDMEiC will help

and-Systems/Statistics-Trends-and-

doctors administer the right dose of drugs to

Reports/NationalHealthExpendData/

the patients.

Downloads/Proj2011PDF.pdf. 3. Jurd, A. (2012), Expenditure on healthcare

Specific Instances: Instead of a standard pill,

in the UK 1997 - 2010, Office for National

patients can be given the right amount of drugs,

Statistics. Available at http://www.ons.

customized according to their needs.

gov.uk/ons/dcp171766_264293.pdf .

45

4. World Health Statistics 2011, World

5. The Ministry of Health, Social Policy and

Health Organization. Available at

Equality Spain (). Available at http://

http://www.who.int/whosis/

www.msssi.gob.es/ssi/violenciaGenero/

whostat/EN_WHS2011_Full.pdf .

publicaciones/comic/docs/PilladaIngles.pdf.

46

Infosys Labs Briefings VOL 11 NO 1 2013

Big Data Powered Extreme Content Hub By Sudheeshchandran Narayanan and Ajay Sadhu

Taming Big content explosion and providing contextual and relevant information is the need of the day

C

ontent is getting bigger by the minute

interacting with the content for e.g., mobile

and smarter by the second [5]. As

devices and tablets , there is a need to re-

content grows in size and becomes varied in

look at the traditional content management

structure, discovery of valuable and relevant

strategies. Artificial intelligence will now play

content becomes a challenge. Existing Content

a key role in information retrieval, information

Management (ECM) products are limited

classification and usage for these sophisticated

by scalability, variety, rigid schema, limited

users. To facilitate the usage of Artificial

indexing and processing capability.

Intelligence on this Big Content, there is a need

Content enrichment often is an external

to have knowledge on entities, domain, etc., to

activity and not often deployed. The content

be captured, processed, reused, and interpreted

manager is more like a content repository

by the computer. This has resulted in formal

and is used primarily for search and retrieval

specification and capture of the structure of

of the published content. Existing content

the domain called ontologies. Classification

management solutions can handle few data

of these entities within the domain into

formats and provide very limited capability

predefined categories called taxonomy and

with respect to content discovery and

inter-relating them to create the semantic web

enrichment.

(web of data).

With the arrival of Big Content, the

The new breed of content management

need to extract, enrich, organize and manage

solutions need to bring in elastic indexing,

the semi-structured and un-structured content

distributed content storage and low latency

and media is increasing. As the next generation

to address these changes. But the story

of users will rely heavily on the new modes of

does not end there. The ease to deploy

47

t e c h n o l o g i e s like natural language text

THE BIG CONTENT PROBLEM IN TODAYS

analytics, machine learning now takes these

ENTERPRISES

new breed of content management to the

Legacy Content Management System (CMS)

next level of maturity. Time is the essence for

has focused on addressing the fundamental

everyone today. Contextual filtering of the

problems in content management i.e., content

content based on relevance is an immediate

organization, indexing, and searching. With

need. There is a need to organize content,

the internet evolution, these CMS’ evolved

create new taxonomy, and create new links

to Content Publishing Lifecycle Management

and relationships beyond what is specified.

(CPLM) and workflow capabilities to the overall

The next generation of content management

offering. The focus of these ECM products were

solutions should leverage the ontologies,

towards providing a solution for the enterprise

semantic web and linked data to derive the

customers to easily store and retrieve various

context of the content and enrich the content

documents and provide a simplified search

metadata with this context. Then leveraging

interface. Some of these solutions evolved to

this context, the system should provide real-

address the web publishing problem. These

time alerts as the content arrives.

existing content management solutions have

In this paper, we discuss the details of

constantly shown performance and scalability

the extreme content hub and its implementation

concerns. Enterprises have invested in high

semantics, technology viewpoint and use

end servers and hired performance engineering

cases.

experts to address this. But will this last long?

Automated Content Discovery

Heterogeneous Content Ingestion

Core Features • Indexing • Search • Workflow • Metadata Repository • Content Versioning

Unified Intelligent Content Access and Insights

Highly Available Elastic Scalable System

Content Enrichment

Figure 1: Augmented Capabilities of Extreme Content Hub Manager

Source: Infosys Research

48

With the arrival of Big data (volume,

Automated Content discovery that extracts the

variety and velocity), these problems have

metadata and classifies the incoming content

amplified further and the need for next

seamlessly to pre-defined ontologies and

generation capabilities for content management

taxonomies.

has evolved further. Requirements and demand has gone

Scalable, Fault-tolerant Elastic System that can

just beyond storing, searching and indexing

seamlessly expand to the demands of volume,

of traditional documents. Enterprise needs

velocity and variety growth of the content.

to store a wide variety of contents ranging from documents, videos, social media feeds,

Content Enrichment services that leverages

blogs posts, podcast, images, etc. Extraction,

machine learning and text analytics technologies

enrichment, organization and management

to enrich the context of the incoming content.

of semi, unstructured and multi-structured content and media are a big challenge today.

Unified Intelligent Content Access that

Enterprises are under tremendous competitive

provides a set of content access services that

pressure to derive meaningful insights from

are context aware and based on information

these piles of information assets and derive

relevance by user modeling and personalization.

business value from this Big data. Enterprises

To realize ECH, there is a need to

are looking for contextual and relevant

augment the existing search and indexing

information at lightning speed. The ECM

technologies with the next generation of

solution must address all of the above technical

machine learning and text analytics to bring

and business requirements.

in a cohesive platform. The existing content management solution still provides quite a good list of features that cannot be ignored.

EXTREME CONTENT HUB: KEY CAPABILITIES Key capabilities required for the Extreme Content

BIG DATA TECHNOLOGIES: RELEVANCE

Hub (ECH) apart from the traditional indexing,

FOR THE CONTENT HUB

storage and search capabilities can be classified

With the advent of Big data, the technology

in the following five dimensions. (Fig. 2)

landscape has made a significant shift. Distributed computing has now become a key that

enabler for large scale data processing and with

provides input adapters to a wide variety of

open source contributions this has received a

content (document, videos, images, blogs,

significant boost in recent years. Year 2012 has

feeds, etc.) into the content hub seamlessly. The

been the year for large scale Big data technology

next generation of content management system

adoption.

Heterogeneous

Content

Ingestion

needs to support

The other significant advancement has been in the NoSQL (Not Only SQL)

Real-Time Content Ingestion for RSS feeds,

technology which complements the existing

news feeds, etc. and support stream of events

RDBMS systems for scalability and flexibility.

to be ingested as one of the key capabilities for

Scalable near real-time access provided by these

content ingestion.

systems has boosted the adoption of distributed

49

Unified Enterprise Content Access

Social Feed Integration

Log Feeds from various enterprise system

Alerts & Content API Service

Dashboard

Content Services Content Classification Service

Search Services Metadata Extractor

Existing Enterprise Content

Un-Structured Content Extractor

Extreme Content Hub Content Management Interface

Content Classification Service

Machine Learning Algorithms Auto Classifier Index Storage (Hbase)

Recommendation Link Storage (Hbase)

Rule Engine

Distributed File System (Hadoop)

Unified Content Extractor

Existing Enterprise CM

Knowledge Feeds to various existing systems

Metadata Driven Augmented CM Processing Framework (Generic Transformation, Dynamic Cluster Expansion, Audit Logging) News, Alerts & RSS Feeds (Real Time)

Content Processing Workflows (Task Co-ordination, sequencing, scheduling etc. for Backend Processing)

Figure 2: Extreme Content Hub

Source: Reference [12]

computing for real-time data storage and

REALIZATION OF THE ECH

indexing needs.

ECH requires a scalable fault tolerant

Scalable and elastic deployments

elastic system that provides scalability on

provided by the advancement in private and

storage, compute and network infrastructure.

public cloud deployments has accelerated

Distributed processing technologies like

adoption of distributed computing in enterprises.

Hadoop provide the foundation platform

Overall, there is a significant change from our

for this. Private cloud based deployment

earlier approaches to solve the ever increasing

model will provide the on-demand elasticity

data and performance problem by throwing

and scale that is required to setup such a

more hardware at the problem. Today deploying

platform.

a scalable distributed computing infrastructure

Metadata model driven ingestion

that not only addresses the velocity, variety

framework could ingest a wide variety of

and volume problem but also providing it at

feeds to the hub seamlessly. Content ingestion

a cost effective alternative using open source

could deploy content security tagging during

technologies provides the business case for

the ingestion process to ensure that the content

building the ECH. The solution to the problem

stored inside the hub is secured and authorized

is to augment the existing content management

before access.

solution with the processing capabilities of the

NoSQL technologies like HBase and

Big data technologies to create a comprehensive

MongoDB could provide the scalable metadata

platform that brings in the best of both worlds.

repository needs for the system.

50

Search and indexing technologies have

ECH could extend as an analytics

matured to be next level after the advent of the

platform for video and text analytics. Real-

Web 2.0 0 and deploying a scalable indexing

time information discovery can be facilitated

service like Solr, Elastic Search, etc., provides

using pre-defined alerts/rules which could get

the much needed scalable indexing and search

triggered as new content arrives in the hub.

capability required for the platform.

The derived metadata and context could

Deploying machine learning algorithms

be pushed to the existing content management

leveraging Mahout and R on this platform can

solution to derive the benefits and investments

bring in auto-discovery of the content metadata

done on the existing products and platforms

and auto-classification for content enrichment.

and augment the processing and analytics

De-duplication and other value added services

capabilities with new technologies.

can be seamless deployed as batch framework

ECH will now be able to handle large

on the Hadoop infrastructure to bring value

volumes, wide variety of content formats and

added context to the content.

bring in deep insights leveraging the power of

Machine learning and text analytics

machine learning. These solutions will be very

technologies can be further leveraged to provide

cost effective and will also leverage existing

the recommendation and contextualization of

investment in the current CMS.

the user interactions to provide unified context aware services.

CONCLUSION There need is to take a platform centric approach

BENEFITS OF ECH

to this Big content problem rather than a

ECH is at the center of enterprise knowledge

standalone content management solution. There

management and innovation. Serving contextual

is a need to look at it strategically and adopt a

and relevant information to the users will be one

scalable architecture platform to address this.

of the fundamental usages ECH.

However such initiative doesn’t need to replace

Auto-indexing will help discover

the existing content management solutions but

multiple facets of the content and help in

to augment the capabilities to fill in required

discovering new patterns and relationships

white spaces. The approach discussed in this

between the various entities that would have

paper provides one such implementation of the

been particular unnoticed in the legacy world.

augmented content hub leveraging the current

The integrated metadata view of the content will

advancement in Big data technologies. Such

help in building a 360 degree view on a particular

an approach will provide the enterprise with a

domain or entity from the various sources.

competitive edge in years to come.

ECH could enable discovery of user taste and likings based on the content searched

REFERENCES

and viewed. This could serve real-time

1. Agichtein, E., Brill, E. and Dumais, S.

recommendation to users through content

(2006), Improving web search ranking by

hub services. This could help the enterprise

incorporating user behavior. Available `at

in specific user behavior modeling. Emerging

http://research.microsoft.com/en-us/

trends in the various domains can be discovered

um/people/sdumais/.

as content gets ingested on the hub.

2. Dumain, S. (2011), Temporal Dynamics

51

and Information Retrieval. Available at

Choose, Develop and Implement a

http://research.microsoft.com/en-us/

Semantic Strategy, http://www.

um/people/sdumais/.

kapsgroup.com/presentations/

3. R e a m y , T . ( 2 0 1 2 ) , T a x o n o m y a n d

ContentCategorization-Development.pdf.

Enterprise Content Management.

5. Barroca, E. (2012), Big data’s Big

Available at http://www.kapsgroup.

Challenges for Content Management,

com/presentations.shtml.

TechNewsWorld. Available at http://

4. Reamy, T. (2012), Enterprise Content

www.technewsworld.com/story/74243.

Categorization – How to Successfully

html.

52

Infosys Labs Briefings VOL 11 NO 1 2013

Complex Events Processing: Unburdening Big Data Complexities By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur

Analyze, crunch and detect unforeseen conditions in real time through CEP of Big Data

A

study by The Economist revealed that 1.27

analyzed as well as a need for timely processing

Zettabyte was the amount of information

and decision making. Any delay even in

in existence in 2010 as household data [1]. The

seconds or milliseconds affects the outcome.

Wall Street Journal reported Big data as the

Significantly, technology should be capable of

new boss in all key sectors such as education,

interpreting historical patterns, apply them to

retail and finance. But on the other side, an

current situations and take accurate decisions

average Fortune 500 enterprise is estimated

with minimal human interference.

to have around 10 years’ worth of customer

Big data is about the strategy to deal

data and more than two-thirds of it being

with vast chunk of incomprehensible data sets.

unusable. How can enterprises make such an

There is now awareness across industries that

explosion of data usable and relevant? Not

traditional methods of data stores and processing

trillions but quadrillions amount of data for

power like databases, files, mainframes or even

analysis overall and it is expected to increase

mundane caching cannot be used as a solution

exponentially and evidently impacts businesses

for Big data. Still the existing models do not

worldwide. Additionally the problem is of

address capabilities of processing, analysis

providing speedier results and that is expected

of data, integrating with events and real time

to go slower with more data to analyze unless

analytics, all in split second intervals.

technologies innovate in the same pace.

On the other hand, Complex Event

Any function or business, whether it is

Processing (CEP) has evolved to provide

road traffic control, high frequency trading,

solutions in utilizing memory data grids for

auto adjudication of insurance claims or

analyzing trends, patterns and events in real

controlling supply chain logistics of electronics

time and assessments in a matter of milliseconds.

manufacturing, all requires huge data sets to be

However, Event Clouds, a byproduct of using

53

CEP techniques, can be further leveraged to

this there is a need to analyze traffic data over

monitor for unforeseen conditions birthing, or

different parameters such as rush hour, accidents,

even the emergence of an unknown-unknown,

seasonal impacts of snow, thunderstorms, etc.,

creating early awareness and potential first

and come up with predictable patterns over years

mover advantage for the savvy organization.

and decades. Second is application of this pattern

To set the context of the paper we attempt

to input conditions. All this requires huge data

at highlighting how CEP with in-memory data

crunching, analyses and on top of it real time

grid technologies helps in pattern detection,

application such as CEP.

matching, analysis, processing and decision

Big data has already taken importance in

making in split seconds with the usage of Big

financial market particularly in high frequency

data. This model should serve any industry

trading. Since the 2008 economic downturn

function where time is the essence and Big

and its rippling effects on the stock market, the

data is at the core and CEP acts as the mantle.

volume of trade has come down at all the top

Later, we propose treating an Event Cloud as

exchanges such as New York, London, Singapore,

more than just an event collection bucket used

Hong Kong or Mumbai. But the contrasting factor

for event pattern matching or as simply the

is the rise in High Frequency Trading (HFT). It is

immediate memory store of an exo-cortex for

claimed that around 70% of all equity trades were

machine learning; an Event Cloud is also a robust

accounted by HFT in 2010 versus 10% in 2000.

corpus with its own intrinsic characteristics that

HFT is 100% dependent on technology and the

can be measured, quantified, and leveraged for

trading strategies are developed out of complex

advantage. For example, by automating the

algorithms. Only those trades will have a better

detection of a shift away from an Event Cloud’s

win ratio that has developed a better strategy

steady state, the emergence of a previously

and has more data to crunch in faster time. This

unconsidered situation may be observed. It is

is where CEP could be useful.

this application, programmatically discerning

The healthcare industry in USA is set to

the shift away from an Event Cloud’s normative

undergo a rapid change with the Affordable

state, which is explored in this paper.

Care Act. Healthcare insurers are expected to see an increase in their costs due to increased risks

CEP AS REAL TIME MODEL FOR BIG DATA:

of covering more individuals and legally cannot

SOME RELEVANT CASES

deny insurance with pre-conditions. Hospitals

In current times, traffic updates are integrated

are expected to see more patient data which

with cities traffic control system as well as

means increased analyses and pharmaceutical

many global positioning service (GPS) electronic

companies need better integration with the

receivers used quite commonly by drivers. These

insurers and consumers to have speedier and

receivers automatically adjust and reroute in case

accurate settlements. Even though most of these

of the normal route is traffic ridden. This helps

transactions can be performed on non-real time

but the solution is reactionary. Many technology

basis, technology still needs both Big data and

companies are investing in pursuit of the holy

complex processing for a scalable solution.

grail of the solution to detect and predict traffic

In India the outstanding cases in various

blockages and take proactive action to control

judicial courts touch 32 million. In USA, family

the traffic itself and even avoid mishaps. For

based cases and immigration related ones

54

are piling up waiting for a hearing. Judicial

data. Information available as part of health

pendency has left no country untouched.

records, geo maps, multimedia (audio, video

Scanning through various federal, state and local

and picture) is essential for many businesses

law points, past rulings, class suits, individual

and mining such unstructured sets require

profiles, evidence details etc., are required to put

storage power as well as transaction processing

forward the cases for the parties involved and the

power. Add this to the variety of sources such as

winner is the one who is able to present a better

social media, legacy systems, vendor systems,

analysis of available facts. Can technology help

localized data, mechanical and sensor data.

in addressing such problems across nations?

Finally the critical component of Speed to get

All of these cases across such diverse

the data through the steps of Unstructured →

industries showcase the importance of

Structured → Storage → Mine → Analyze →

processing gigantic amounts of data and also

Process → Crunch → Customize → Present.

the need to have the relevant information churned out in right time.

BIG DATA METHODOLOGIES: SOME EXAMPLES

WHY AND WHERE BIG DATA

Apache™ Hadoop™ project [2] and its relatives

Big data has evolved due to the existing limitations

such as Avro™, ZooKeeper™, Cassandra™,

of current technologies. Two-tier or multi-

Pig™ provided the non-database form of

tier architecture with even a high performing

technology as the way to solve problems with

database at one end is not enough to analyze

massive data. It used distributed architecture

and crunch such colossal information in desired

as the foundation to remove the constraints of

time frames. The fastest databases today are

traditional constructs.

benchmarked at tera bytes of information as

Both Data (storage, transportation) and

noted by the transaction processing council

Processing (analysis, conversion, formatting)

Volumes of exa and zetta bytes of data need a

are distributed in this architecture. Figure 1 and

different technology. Analysis of unstructured

Figure 2 compare the traditional vs. Distributed

data is another criterion for the evolution of Big

Architecture.

Data Nodes Data Nodes

Validation Enrichment

Processing Nodes

Transformation Strandardization

Data Nodes Data Nodes

Route Processing Nodes

Operate Server Tier

Middle Tier

Distributed Nodes

Client Tier

Figure 1: Conventional Multi-Tier Architecture Source: Infosys Research

Client Tier

Figure 2: Distributed Multi-Nodal Architecture Source: Infosys Research

55

A key advantage of distributed

There are multiple business scenarios

architecture is scalability. Nodes can be added

in which data has to be analyzed in real time.

without affecting the design of the underlying

These data are created, updated and transferred

data structures and processing units.

because of real time business or system level

IBM has even gone a step ahead in getting

events. Since the data is in the form of real time

Watson [5], the famous artificial intelligent

events, this requires a paradigm shift in the

computer which can learn as it gets more

methodology in the way data is viewed and

information and patterns for decision making.

analyzed. Real time data analyses in such cases

Similarly IBM [6], Oracle [7], Teradata

means that data has to be analyzed before the

[8] and many leading software providers

data hits the disk. Difference between ‘event’

have created the Big data methodologies as

and ‘data’ just vanishes.

an impetus to help enterprise information

In such cases across the industry where

management.

Big data is unequivocally needed to manage the data but to use this data effectively and

VELOCITY PROBLEM IN BIG DATA

integrate with real time events and provide

Even though we clearly see the benefits of Big

business with express results, a complimentary

data and its architecture can easily be applicable

technology is required and that’s where CEP

to any industry, there are some limitations that

can fit in.

is not easily perceivable. Few pointers: VELOCITY PROBLEM: CEP AS A SOLUTION ■■ Can Big data help a trader to give the

The need here is the analyses of data arriving

best win scenarios based on millions and

through the form of real time event streams

even billions of computations of multiple

and identifying patterns or trends based on

trading parameters in real time?

vast historical data. Adding to the complexity is other real time events.

■■ Can Big data forecast traffic scenarios

The vastness is solved with Big data and

based on sensor data, vehicle data,

real time analysis of multiple events, pattern

seasonal change, major public events and

detection and appropriate matching and

provide alternate path to drivers through

crunching is solved by CEP.

their GPS devices in real time helping both

Real time event analysis ensures avoiding

city officials as well as drivers to save time?

duplicates and synchronization issues as data is still in flight and storage is still a step away.

■■ Can Big data detect fraud detection

Similarly it facilitates predictive analysis of data

scenarios running through multiple

by means of pattern matching and trending.

shopping patterns of a user through

This enables enterprise to provide early

historical data and match with the

warning signals and take corrective measures

current transaction in real time?

in real time itself. Reference architecture of traditional CEP

■■ Can Big data provide real time analytical

is shown in Figure 3.

solutions out of the box and support

CEP’s original objective was to provide

predictive analytics?

processing capability similar to Big data with

56

Dev., Business User Tools (Platform Independent)

Feature Set Debug Capability Standard Functions Multi User Support Language Constucts

Event Generation and Capture

Event Event Catalog Originator Domain Object Model model Catalog Catalog

Refine

Event Handlers

Aggregate and correlate

Visualize

Event Processing Engine

Event Processing and Logic

Actions

Patterns

Event Streams

Event Pre-filtering

Performance

Preprocessing

Domain Specific Algorithms

Patterns

User Roles

CEP Languages

Monitoring and Administration Tools

Event Modeling and Management

Security and Authentication

Security and Search

Scalability

Memory Management

Storage Options

Relationships

Failure and Recovery

Persistence Models

Event Attributes

Access Management

Meta Data Repository

Event Access

Event Consumer

Figure 3: Complex Events ProcessingReference Architecture

Source: Infosys Research

distributed architecture and in memory grid

the in memory data grid and every new event

computing. The difference was CEP was to

(transactions) from the customer is analyzed

handle multiple events seemingly unrelated

by CEP engine by correlating and applying

and correlate them to provide a desired and

patterns on the event data with the historic data

meaningful output. The backbone of CEP

stored in the memory grid.

though can be the traditional architectures such

There are multiple scenarios some

multi-tier technologies with CEP usually in the

of them outlined through this paper where

middle tier.

CEP complements Big data and other offline

Figure 4 shows how the CEP on Big

analytical approaches to accomplish an active

data solves the velocity problem with Big

and dynamic event analytics solution.

data and complements the overall information management strategy for any enterprise that

EVENT CLOUDS AND DETECTION

aims to use Big data. CEP can utilize Big data

TECHNIQUES

particularly by highly scalable in-memory data

CEP and Event Clouds

grids to store the raw feeds, events of interests

A linearly ordered sequence of events is called

and detected events and analyze this data in real

an event stream [9]. An event stream may

time by correlating with other in flight events.

contain many different types of events, but

Fraud detection is a very apt example where

there must be some aspect of the events in the

historic data of the customer’s transaction,

event stream that allow for a specific ordering.

his usage profile, location, etc., is stored in

This is typically an ordering via timestamp.

57

Dev., Business User Tools (Platform Independent)

Feature Set Debug Capability Standard Functions Multi User Support Language Constucts

Event Generation and Capture

Event Event Catalog Originator Domain Object Model model Catalog Catalog Meta Data Repository

Event Access Persistence Models

Event Attributes Storage Options

Relationships

Scalability

Security and Search

Event Modeling and Management

CEP Languages Preprocessing

Domain Specific Algorithms

Patterns

Aggregate and correlate

Refine

Event Handlers

Visualize

Event Processing Engine

Event Processing and Logic

Actions

Patterns

Event Streams

Query Agent

In Memory DB or Data Grid

Write Connector Big Data

Dashboard

Event Consumer

Figure 4: CEP on Big Data

Source: Infosys Research

By watching for Event patterns of interest, such

organization, such as stock market trades or

as multiple usages of the same credit card at

tweets from a particular twitter user. Event

a gas station within a 10 minute window, in

Clouds and event streams may have business

an event stream, systems can respond with

events, operational events, or both. Strictly

predefined business driven behaviors, such as

speaking, an event stream is an Event Cloud,

placing a fraud alert on the suspect credit card.

but an Event Cloud may or may not be an event

An Event Cloud is “a partially ordered

stream, as dictated by the ordering requirement.

set of events (POSET), either bounded or

Typically, a landscape with CEP

unbounded, where the partial orders are imposed

capabilities will include three logical units:

by the causal, timing and other relationships

(i) emitters that serve as sources of events, (ii)

between events” [10]. As such, it is a collection of

a CEP engine, and (iii) targets to be notified

events within which the ordering of events may

under certain event conditions. Sources can

not be possible. Further, there may or may not

be anything from an application to a sensor to

be an affinity of the events within a given Event

even the CEP engine itself. CEP engines, that

Cloud. If there is an affinity, it may be as broad

are the heart of the system, are implemented

as “all events of interest to our company” or as

in one of two fundamental ways. Some follow

specific as “all events from the emitters located

the paradigm of being rules based, matching on

at the back of the building.”

explicitly stated event patterns using algorithms

Event Clouds and event streams may

like Rete, while other CEP engines use the more

contain events from sources outside of an

sophisticated event analytics approach looking

58

for probabilities of event patterns emerging using

However, by adding the Event Cloud, or event

techniques like Bayesian Classifiers [11]. In either

stream, to the pool of elements being observed,

case of rules or analytics, some consideration of

emergent patterns not previously considered

what is of interest must be identified up front.

can be brought to light. This is the crux of this

Targets can be anything from dashboards to

paper, using the Event Cloud as a porthole into

applications to the CEP engine itself.

unconsidered situations emerging.

Users of the system, using the tools provided by the CEP provider, articulate events

EVENT CLOUDS HAVE FORM

and patterns of events that they are interested

As represented in Figure 5 , there is a

in exploring, observing, and/or responding to.

point wherein events flowing through a CEP

For example, a business user may indicate to

engine are unprocessed. This point is an Event

the system that for every sequence wherein a

Cloud, which may or may not be physically

customer asks about a product three times but

located within a CEP engine memory space.

does not invoke an action that results in a buy,

This Event Cloud has events entering its

the system is then to provide some promotional

logical space and leaving it. The only bias to

material to the customer in real-time. As another

the events travelling through the CEP engine’s

example, a technical operations department

Event Cloud is based on which event sources

may issue event queries to the CEP engine,

are serving as inputs to the particular CEP

in real time, asking about the number of

engine. For environments wherein all events,

server instances being brought online and

regardless of source, are sent to a common

the probability that there may be a deficit in

CEP engine, there is no bias of events within

persistence storage to support the servers.

the Event Cloud.

Focusing on events, while extraordinarily

There are a number of attributes about

powerful, biases what can be cognized. That

the Event Cloud that can be captured, depending

is, what you can think of, you can explore.

upon a particular CEP’s implementation.

What you can think of, you can respond to.

For example, if an Event Cloud is managed

Input Input Adapter Adapter Input Input Adapter Adapter

Output Adapter

Filter Filter Event Event Cloud Cloud

Union Union Apply Apply Rules Rules Correlate Correlate Match Match

Figure 5: CEP Engine Components

Source: Infosys Research

59

Output Bus Bus Output

Input Input Adapter Adapter

Event Ingress Ingress Bus Bus Event

Input Input Adapter Adapter

Output Adapter

Output Adapter

Event Cloud

Event S

Event Cloud Steady State Shift Buy

Event M

Ask

Event A

Ask Ask

Buy

Buy

Event A

Buy Ask

Look

Event M

Ask

Ask Buy

Event S Look Buy

Event A

Ask Ask

Event M Event Cloud Steady State Form

Event A

Event Cloud New Form

Figure 6: Event Cloud (The Events traversing an Event Cloud at any particular moment give it shape and size) Source: Infosys Research

Figure 7: Event Cloud Shift (Shape shifts as new patterns occur) Source: Infosys Research

in memory and is based on a time window,

causes an Event Cloud’s shape to shift away

for e.g., events of interest only stay within

from its steady state, a situation change has

consideration by the engine for a period of

occurred Figure 7. When these steady state

time, then the number of events contained

deviations happen, and if no new matching

within an Event Cloud can be counted. If the

patterns or rules are being invoked, then an

structure holding an Event Cloud expands

unknown-unknown may have emerged. That

and contracts with the events it is funneling,

is, something significant enough to adjust your

then the memory footprint of the Event Cloud

systems operating characteristics has occurred

can be measured. In addition to the number of

yet isn’t being acknowledged in some way.

events and the memory size of the containing

Either it has been predicted but determined

unit, the counts of the event types themselves

to not be important, or it was simply not

that happen to be present at a particular time

considered.

within the Event Cloud become a measurable characteristic. These properties, viz., memory

ANOMALY DETECTION APPLIED TO

size, event counts, and event types, can serve as

EVENT CLOUD STEADY STATE SHIFTS

measurable characteristics describing an Event

Finding patterns in data that do not match

Cloud, giving it a size and shape Figure 6.

a baseline pattern is the realm of anomaly detection. As such, by using the steady state of

EVENT CLOUD STEADY STATE

an Event Cloud as the baseline we can apply

The properties of an Event Cloud that give

anomaly detection techniques to discern a shift.

it form can be used to measure its state. By

Table 1 presents a catalog of various

collecting its state over time, a normative

anomaly detection techniques that are applicable

operating behavior can be identified and its

to Event Cloud shift discernment. This list isn’t

steady state can be determined. This steady

to serve as an exhaustive compilation, but

state is critical when watching for unpredicted

rather to showcase the variety of possibilities.

patterns. When a new flow pattern of events

Each algorithm has its own set of strengths

60

Technique Classification

Example Constituent Techniques

Event Cloud Shift Applicability Challenges

Neural Networks | Bayesian Networks

Accurately labeledtraining data for the

Support Vector Machines Rule

classifiers is difficult to obtain

Nearest Neighbour Based Clustering Based

Distance to kth Nearest Neighbour Relative Density

Defining meaningful distance measures

Statistical

Parametric | Non-Parametric

Histogram approaches miss unique combinations

Spectral

Low Variance PCA Eigenspace - Based

High computational complexity

Classification Based

difficult

Table 1: Applicability of Anomaly Detection Techniques to Event Cloud Steady State Shifts

Source: Derived from Anomaly Detection: A survey [12]

such as simplicity, speed of computation, and

Further, knowing an Event Cloud’s steady state

certainty scores. Each algorithm, likewise, has

shape a priori isn’t assumed, so the use of a non-

weaknesses to include computational demands,

parametric statistical model is appropriate [13].

blind spots in data deviations, and difficulty in

Therefore, the technique of statistical profiling

establishing a baseline for comparison. All of

using histograms is explored as an example

these factors must be considered when selecting

implementation approach for catching a steady

an appropriate algorithm.

state shift.

Using the three properties defined for an

One basic approach to trap the moment

Event Cloud’s shape (for e.g., event counts, event

of an Event Cloud’s steady state shift is to

types, and Event Cloud size) combined with

leverage a histogram based on each event type,

time properties, we have a multivariate data

with the number of times a particular count of

instance with three of them being continuous

an event type shows up in a given Event Cloud

types, viz., counts, sizes, and time and one being

instance becoming a basis for comparison. The

categorical, viz., types. These four dimensions,

histogram generated over time would then

and their characteristics, become a constraint

serve as the baseline steady state picture of

on which anomaly detection algorithms can be

normative behavior. Individual instances of an

applied [13].

Event Cloud’s shape could then be compared

The anomaly type being detected is

to the Event Cloud’s steady state histogram to

also a constraint. In this case, the Event Cloud

discern if a deviation has occurred. That is, does

deviations are being classified as collective

the particular Event Cloud instance contain

anomaly. It is collective anomaly, as opposed

counts of events that have rarely, or never,

to point anomaly or context anomaly as we are

appeared in the Event Cloud’s history.

comparing a collection of data instances that

Figure 8 represents the case with a

form the Event Cloud shape with a broader

steady state histogram on the left, and the Event

set of all data instances that formed the Event

Cloud comparison instance on the right. In this

Cloud steady state shape.

depiction the histogram shows, as an example,

Statistical algorithms lend themselves

that three Ask Events were contained within

well to anomaly detection when analyzing

an Event Cloud instance exactly once in the

continuous and categorical data instances.

history of this Event Cloud. The Event Cloud

61

a single observer detecting when an Event

Event Cloud Histogram and Instance Comparison

Cloud deviates from steady state, a system

A

could have multiple observers, each with their

A

1 2 3 1

2 3 1

2 3

Look Ask Buy Event (s) Event (s) Event (s) Event Cloud Steady State Histogram

A

B

own techniques and approaches applied. Their

A

B

A

B

individual results could then be aggregated,

A

B

with varying weights applied to each technique, to render a composite Event Cloud steady state

Buy Look Event Event Event Cloud Comparison Instance

Ask Event

shift score. This will help remove the chances of missing a state change shift. With the approach outlined by this

Figure 8: Event Cloud Histogram & Comparison Source: Infosys Research

paper, the scope of indicators is such that you get an early indicator that something new is emerging and nothing more. Noticing an Event

instance, on the right, that will be compared

Cloud shift only indicates that a situational

shows that the instance has six Ask Events in

change has occurred; it does not identify or

its snap shot state.

highlight what the root cause of the change is,

An anomaly score for each event type

nor does it fully explain what is happening.

is calculated, by comparing each Event Cloud

Analysis is still required to determine what

instance event type count to the event type

initiated the shift along with what opportunities

quantity occurrence bins within the Event

for exploitation may be present.

Cloud steady state histogram, and then these individual scores are combined for an aggregate

FURTHER RESEARCH

score [13]. This aggregate score then becomes the

Many enterprise CEP implementations are

basis upon which a judgment is made regarding

architected in layers, wherein event abstraction

a whether deviation has occurred or not.

hierarchies, event pattern maps and event

While simple to implement, the primary

processing networks are used in concert to

weakness of using the histogram based

increase the visibility aspects of the system [14]

approach is that a rare combination of events

as well as to help with overall performance by

in an Event Cloud would not be detected, if the

allowing for the segmenting of Event flows.

quantities of the individual events present were

In general, each layer going up the hierarchy

in their normal or frequent quantities.

is an aggregation of multiple events from its immediate child layer. With the lowest layer

LIMITATIONS OF EVENT CLOUD SHIFTS

containing the finest grained events and the

Anomaly detection algorithms have blind

highest layer containing the coarsest grained

spots, or situations where they cannot discern

events, the Event Clouds that manifest at

an Event Cloud shift. This implies that it is

each layer are likewise of varying granularity

possible for an Event Cloud to shift undetected,

(Figure 9). Therefore a noted Event Cloud

under just the right circ*mstances. However,

steady state shift at the lowest layer represents

following the lead suggested by Okamoto

the finest granularity shift that can be observed.

and Ishida with immunity-based anomaly

An Event Cloud’s steady state shifts at the

detection systems [13], rather than having

highest layer represent the coarsest steady

62

data analysis. CEP though designed purely for

CEP In Layers

events complements the Big data strategy of

Event Clouds

any enterprise.

Events TH AN

TH

S S

A

M

A

S

M

Event Cloud, a constituent component

AN

S

of CEP can be used for more than its typical application. By treating it as a first class citizen

M S

of indicators, and not just a collection point

M

computing construct, a company can gain

A S A

insight into the early emergence of something new, something previously not considered

Figure 9: Event Hierarchies Source: Infosys Research

and potentially the birthing of an unknownunknown.

state shifts that can be observed. Techniques

With organizations growing in their

for interleaving individual layer Event Cloud

usage of Big data, and the desire to move closer

steady state shifts along with opportunities

to real time response, companies will inevitably

and consequences of their mixed granularity

leverage the CEP paradigm. The question

can be explored.

will be do they use it as everyone else does,

The technique presented in this paper

triggering off of conceived patterns, or will they

is designed to capture the beginnings of

exploit it for unforeseen situation emergence?

a situational change not explicitly coded

When the situation changes, the capability is

for. With the recognition of a new situation

present and the data is present, but are you?

emerging, the immediate task is to discern what is happening and why, while it is unfolding.

REFERENCES

Further research can be done to discern which

1. WSJ article on Big data. Available at

elements available from the steady state

http://online.wsj.com/article/SB1000

shift automated analysis would be of value

0872396390443890304578006252019616

to help an analyst — business or technical

768.html.

-- unravel the genesis of the situation change.

2. T r a n s a c t i o n P r o c e s s i n g C o u n c i l

By discovering what change information is of

Benchmark comparison or leading

value, not only can an automated alert be sent

databases. Available at http://www.tpc.

to interested parties, but it can contain helpful

org/tpcc/results/tpcc_perf_results.asp.

clues on where to start their analysis.

3. T r a n s a c t i o n P r o c e s s i n g C o u n c i l Benchmark comparison or leading

CONCLUSION

databases. Available at http://www.tpc.

It would be an understatement that without the

org/tpcc/results/tpcc_perf_results.asp.

right set of systems, methodologies, controls,

4. Apache Hadoop project site. Available

checks and balances on data, no enterprise can

at http://hadoop.apache.org/.

survive. Big data solves the problem of vastness

5. IBM Watson – Artificial intelligent super

and multiplicity of the ever rising information

computer’s Home Page. Available at

in this information age. What Big data does not

http://www-03.ibm.com/innovation/

fulfill is the complexity associated with real time

us/watson/.

63

6. IBM’s Big data initiative. Available at

13. Okamoto, T. and Ishida, Y. (2009), An

http://www-01.ibm.com/software/

Immunity-Based Anomaly Detection

data/bigdata/.

System with Sensor Agents, sensor ISSN

7. Oracle’s Big data initiative. Available

1424-8220.

at http://www.oracle.com/us/

14. Luckham, D. (2002), The Power of

technologies/big-data/index.html.

Events, An Introduction to Complex

8. Teradata Big data Analytics offerings.

Event Processing in Distributed

Available at http://www.teradata.com/

Enterprise Systems, Addison Wesley,

business-needs/Big-Data-Analytics/.

Boston.

9. Luckham, D. and Schulte, R. (2011),

15. Vincent, P. (2011), ACM Overview

Event Processing Glossary – Version 2.0,

of BI Technology misleads on CEP.

Compiled. Available at http://www.

Available at http://www.thetibcoblog.

complexevents.com/2011/08/23/event-

com/2011/07/28/acm-overview-of-bi-

processing-glossary-version-2-0/.

technology-misleads-on-cep/.

10. Bass, T. (2007), What is Complex Event

16. About Esper and NEsper FAQ, http://

Processing? TIBCO Software Inc.

esper.codehaus.org/tutorials/faq_

11. B a s s , T . ( 2 0 1 0 ) , O r w e l l i a n E v e n t

esper/faq.html#what-algorithms.

Processing. Available at http://www.

17. I d e , T . a n d K a s h i m a , H . ( 2 0 0 4 ) ,

thecepblog.com/2010/02/28/orwellian-

Eigenspace-based Anomaly Detection

event-processing/.

in Computer Systems, Tenth ACM

12. Chandola, V., Banerjee, A., and Vipin

SIGKDD International Conference on

Kumar, V. (2009), Anomaly Detection :

Knowledge Discovery and Data Mining,

A Survey, ACM Computing Surveys.

August pp. 22-25.

64

Infosys Labs Briefings VOL 11 NO 1 2013

Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja

Validate data quality by employing a structured testing technique

T

esting Big data is one of the biggest

structured and unstructured data validation,

challenges faced by organizations because

data storage validation are important to ensure

of lack of knowledge on what to test and how

that the data is correct and is of good quality.

much data to test. Organizations have been

Apart from functional validations other non-

facing challenges in defining the test strategies

functional testing like performance and failover

for structured and unstructured data validation,

testing plays a key role to ensure the whole

setting up an optimal test environment, working

process is scalable and is happening within

with non-relational databases and performing

specified SLA.

non-functional testing. These challenges are

Big data implementation deals with

causing in poor quality of data in production

writing complex Pig, Hive programs and

and delayed implementation and increase in

running these jobs using Hadoop map reduce

cost. Robust testing approach need to be defined

framework on huge volumes of data across

for validating structured and unstructured

different nodes. Hadoop is a framework that

data and start testing early to identify possible

allows for the distributed processing of large

defects early in the implementation life cycle

data sets across clusters of computers. Hadoop

and to reduce the overall cost and time to

uses Map/Reduce, where the application is

market.

divided into many small fragments of work,

Different testing types like functional

each of which may be executed or re-executed

and non-functional testing are required along

on any node in the cluster. Hadoop utilizes its

with strong test data and test environment

own distributed file system, HDFS, which makes

management to ensure that the data from varied

data available to multiple computing nodes.

sources is processed error free and is of good

Figure 1 shows the step by step process

quality to perform analysis. Functional testing

on how Big data is processed using Hadoop

activities like validation of map reduce process,

ecosystem. First step loading source data into

65

1 Loading Source

data files into HDFS

2 Perform Map

Reduce operations

3 Extract the output results from HDFS

Figure 1: Big Data Testing Focus Areas

Source: Infosys Research

HDFS involves in extracting the data from

Testing should be performed at each of

different source systems and loading into

the three phases of Big data processing to

HDFS. Data is extracted using crawl jobs for

ensure that data is getting processed without

web data, tools like sqoop for transactional

any errors. Functional Testing includes (i)

data and then loaded into HDFS by splitting

validation of pre-Hadoop processing; (ii),

into multiple files. Once this step is completed

validation of Hadoop Map Reduce process

second step perform map reduce operations

data output; and (iii) validation of data

involves in processing the input files and

extract, and load into EDW. Apart from these

applying map and reduce operations to get a

functional validations non-functional testing

desired output. Last setup extract the output

including performance testing and failover

results from HDFS involves in extracting the

testing needs to be performed.

data output generated out of second step and

Figure 2 shows a typical Big data

loading into downstream systems which can

architecture diagram and highlights the areas

be enterprise data warehouse for generating

where testing should be focused.

analytical reports or any of the transactional systems for further processing

Validation of Pre-Hadoop Processing Data from various sources like weblogs, social

BIG DATA TESTING APPROACH

network sites, call logs, transactional data

As we are dealing with huge data and executing

etc., is extracted based on the requirements

on multiple nodes there are high chances of

and loaded into HDFS before processing it

having bad data and data quality issues at each

further.

stage of the process. Data functional testing is performed to identify these data issues because

Issues: Some of the issues which we face during

of coding errors or node configuration errors.

this phase of the data moving from source

66

25% 25% 25% 25%

Big Data Analytics

1

2

3

4

5

Bar graph

4

Enterprise Data Warehouse

ReportsTesting 3

hadoop

Processed Data

1

Pig

ETL Process validation

HIVE

HBase (NoSQL DB)

Map Reduce (Job Execution)

ETL Process 2

Map-Reduce process validation

HDFS (Hadoop Distributed File System)

Pre-Hadoop process validation Web Logs

Data Load using Sqoop Streaming Data

Social Data

Transactional Data (RDBMS)

Non-FunctionalT esting (Performance, Fail over testing)

Big Data Testing Focus Areas Reporting using BI Tools

4

Figure 2: Big Data architecture

Source: Infosys Research

systems to Hadoop are incorrect data captured

Validation of Hadoop Map Reduce Process

from source systems, incorrect storage of data, incomplete or incorrect replication.

Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data

Validations: S o me h i g h l e v e l s ce n a r i os

coming from different sources.

that need to be validated during this phase include:

Issues: Some issues that we face during this phase of the data processing are coding issues

1. Comparing input data file against

in map-reduce jobs, jobs working correctly

source systems data to ensure the data

when run in standalone node, but working

is extracted correctly

incorrectly when run on multiple nodes, incorrect aggregations, node configurations,

2. Validating the data requirements and

and incorrect output format.

ensuring the right data is extracted, Validations: Some high level scenarios that 3. Validating that the files are loaded into

need to be validated during this phase

HDFS correctly, and

include:

4. Validating the input files are split,

1. V a l i d a t i n g t h a t d a t a p r o c e s s i n g

moved and replicated in different data

is completed and output file is

nodes.

generated

67

2. V a l i d a t i n g t h e b u s i n e s s l o g i c o n

3. Validating the data load in target system

standalone node and then validating after running against multiple nodes

4. Validating the aggregation of data

3. Validating the map reduce process to

5. Validating the data integrity in the target

verify that key value pairs are generated

system.

correctly Validation of Reports 4. V a l i d a t i n g t h e a g g r e g a t i o n a n d

Analytical reports are generated using reporting

consolidation of data after reduce

tools by fetching the data from EDW or running

process

queries on Hive.

5. Validating the output data against

Issues: Some of the issues faced while generating

the source files and ensuring the data

reports are report definition not set as per the

processing is completed correctly

requirement, report data issues, layout and format issues.

6. Validating the output data file format and ensuring that the format is per the

Validations: Some high level validations

requirement.

performed during this phase include:

Validation of Data Extract, and Load into EDW

Reports Validation: Reports are tested after

Once map-reduce process is completed and data

ETL/transformation workflows are executed for

output files are generated, this processed data

all the sources systems and the data is loaded

is moved to enterprise data warehouse or any

into the DW tables. The metadata layer of the

other transactional systems depending on the

reporting tool provides an intuitive business

requirement.

view of data available for report authoring. Checks are performed by writing queries to

Issues: Some issues that we face during this

verify whether the views are getting the exact

phase include incorrectly applied transformation

data needed for the generation of the reports. Cube Testing: Cubes are testing to verify

rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS.

that dimension hierarchies with pre-aggregated values are calculated correctly and displayed

Validations: Some high level scenarios that

in the report. Dashboard Testing: Dashboard testing

need to be validated during this phase include:

consists of testing of individual web parts and 1. Validating that transformation rules are

reports placed in a dashboard. Testing would

applied correctly

involve ensuring all objects are rendered properly and the resources on the webpage

2. Validating that there is no data corruption by

are current and latest. The data fetched from

comparing target table data against HDFS

various web parts is validated against the

files data

databases.

68

VOLUME, VARIETY AND VELOCITY:

100% data comparison. To reduce the time for

HOW TO TEST?

execution we can either run all the comparison

In the earlier sections we have seen step by step

scripts in parallel on multiple nodes just like

details on what need to be tested at each phase

how data is processed using Hadoop map-

of the Big data processing. During these phases

reduce process or sample the data ensuring

of Big data processing the three dimensions or

maximum scenarios are covered.

characteristics of Big data i.e. volume, variety

Figure 3 shows the approach on how

and velocity are validated to ensure there are no

voluminous amount of data is compared. Data

data quality defects and no performance issues.

is converted into expected result format and then compared using compare tools with actual

Volume: The amount of data created both inside

data. This is a faster approach but involves

corporations and outside the corporations via

initial scripting time. This approach will reduce

the web, mobile devices, IT infrastructure,

further regression testing cycle time. When

and other sources is increasing exponentially

we don’t have time to validate complete data,

each year [3]. Huge volume of data flows from

sampling can be done for validation.

multiple systems which need to be processed and analyzed. When it comes to validation it is

Variety: The variety of data types is increasing,

a big challenge to ensure that whole data setup

namely unstructured text-based data and semi-

processed is correct. Manually validating the

structured data like social media data, location-

whole data is a tedious task. We should use

based data, and log-file data.

compare scripts to validate the data. As data

Structured Data is data which is in

is stored in HDFS is in file format scripts can

defined format which is coming from different

be written to compare two files and extract the

RDBMS tables or from structured files. The

differences using compare tools [4]. Even if we

data that is of transactional nature can be

use compare tools it will take a lot of time to do

handled in files or tables for validation purpose.

Testing Scripts to validate data in HDFS

Map Reduce Jobs run in test environment to generate the output

Output Data Files

Raw Data to Expected Results format

(data) “testing”

Structured data 1 SD Test 2 SD1 Test1

Unstructured (data) “testing”

Structured data testing

Tool to compare the files

Unstructured

Unstructured to Structured

Map Reduce Jobs

Actual Results

Custom scripts to convert unstructured data to structured data

Scripts to convert data to expected results data File by File Comparison Discrepancy Report

Figure 3: Approach for High Volume Data Validation

Source: Infosys Research

69

Structured data testing

Expected Results

Expected Results

Semi-structured data does not have any defined

important role to identify any performance

format but structure can be derived based on the

bottleneck in the system and the system can

multiple patterns of the data. Example of data is

handle high velocity streaming data.

extracted by crawling through different websites for analysis purposes. For validation data need

NON-FUNCTIONAL TESTING

to be first transformed into structured format

In the earlier sections we have seen how

using custom built scripts. First the pattern

functional testing is performed at each phase of

need to be identified and then copy books

Big data processing, these tests are performed to

or pattern outline need to be prepared, later

identify functional coding issues, requirements

this copy book need to be used in scripts to

issues. Performance testing and failover testing

convert the incoming data into a structured

need to be performed to identify performance

format and then validations performed using

bottlenecks and to validate the non-functional

compare tools.

requirements.

Unstructured data is the data that does not have any format and is stored in

Performance Testing: Any Big data project

documents or web content, etc. Testing

involves in processing huge volumes of

unstructured data is very complex and is time

structured and unstructured data and is

consuming. Automation can be achieved to

processed across multiple nodes to complete

some extent by converting the unstructured

the job in less time. At times because of

data into structured data using scripting like

bad architecture and poorly designed code,

PIG scripting as showing in Figure 3. But

performance is degraded. If the performance

the overall coverage using automation will

is not meeting the SLA, the purpose of setting

be very less because of unexpected behavior

up Hadoop and other Big data technologies is

of data; input data can be in any form and

lost. Hence, performance testing plays a key role

changes every time new test is performed. We

in any Big data project due to huge volume of

need to deploy a business scenario validation

data and complex architecture.

strategy for unstructured data. In this strategy

Some of the areas where performance

we need to identify different scenarios that

issues can occur are imbalance in input splits,

can occur in our day to day unstructured data

redundant shuffle and sorts, moving most

analysis and test data need to be setup based

of the aggregation computations to reduce

on test scenarios and executed.

process which can be done at map process. [5]. These performance issues can be eliminated

Velocity: The speed at which new data is being

by carefully designing the system architecture

created – and the need for real-time analytics

and doing performance test to identify the

to derive business value from it -- is increasing

bottlenecks.

thanks to digitization of transactions, mobile

Performance testing is conducted

computing and the sheer number of internet

by setting up huge volume of data and an

and mobile device users. Data speed needs

infrastructure similar to production. Utilities

to be considered when implementing any

like Hadoop performance monitoring tool can

Big data appliance to overcome performance

be used to capture the performance metrics and

problems. Performance testing plays an

identify the issues. Performance metrics like

70

Key steps involved in setting up environment

job completion time, throughput, and system

on cloud are [6]:

level metrics like memory utilization etc. are captured as part of performance testing.

A. Big data Test infrastructure requirement assessment

Failover Testing: Hadoop architecture consists of a name node and hundreds of data notes hosted on several server machines and each

1. A s s e s s t h e B i g d a t a p r o c e s s i n g

of them are connected. There are chances of

requirements

node failure and some of the HDFS components become non-functional. Some of the failures can

2. Evaluate the number of data nodes

be name node failure, data node failure and

required in QA environment

network failure. HDFS architecture is designed to detect these failures and automatically

3. U n d e r s t a n d t h e d a t a p r i v a c y

recover to proceed with the processing.

requirements to evaluate private or

Failover testing is an important focus

public cloud

area in Big data implementations with the objective of validating the recovery process

4. Evaluate the software inventory required

and to ensure the data processing happens

to be setup on cloud environment

seamlessly when switched to other data nodes.

(Hadoop, File system to be used, No

Some validations that need to be

SQL DBs, etc).

performed during failover testing are validating that checkpoints of edit logs and FsImage

B. Big data Test infrastructure design

of name node are happening at a defined intervals, recovery of edit logs and FsImage

1. Document the high level cloud test

files of name node, no data corruption because

infrastructure design (Disk space, RAM

of the name node failure, data recovery when

required for each node, etc.)

data node fails and validating that replication is initiated when one of data node fails or data

2. Identify the cloud infrastructure service

become corrupted. Recovery Time Objective

provider

(RTO) and Recovery Point Objective (RPO) metrics are captured during failover testing.

3. Document the SLAs, communication plan, maintenance plan, environment refresh plan

TEST ENVIRONMENT SETUP As Big data involves handling huge volume and processing across multiple nodes, setting

4. Document the data security plan

up a test environment is the biggest challenge. Setting up the environment on cloud will

5. Document high level test strategy,

give us the flexibility to setup and maintain it

testing release cycles, testing

during test execution. Hosting the environment

types, volume of data processed

on the cloud will also help in optimizing the

by Hadoop, third party tools

infrastructure and faster time to market.

required.

71

C. Big data Test Infrastructure Implementation

functional and non-functional requirements.

and Maintenance

Applying right test strategies and following best practices will improve the testing quality

■■ Create a cloud instance of Big data test

which will help in identifying the defects early

environment

and reduce overall cost of the implementation. It is required that organizations invest in building

■■ Install Hadoop, HDFS, MapReduce and other

skillset both in development and testing. Big

software as per the infrastructure design

data testing will be a specialized stream and testing team should be built with diverse skillset

■■ Perform a smoke test on the environment

including coding, white-box testing skills and

by processing a sample map reduce,

data analysis skills for them to perform a better

Pig/Hive jobs

job in identifying quality issues in data.

■■ Deploy the code to perform testing.

REFERENCES 1. Big data overview, Wikipedia.org at http://en.wikipedia.org/wiki/Big_data.

BEST PRACTICES Data Quality: It is very important to establish

2. White, T. (2010), Hadoop- The Definitive

the data quality requirements for different

Guide 2nd Edition, O’Reilly Media.

forms of data like traditional data sources, data

3. Kelly, J. (2012), Big data: Hadoop,

from social media, data from sensors, etc. If the

Business Analytics and Beyond, A

data quality is ascertained, the transformation

Big data Manifesto from the Wikibon

logic alone can be tested, by executing tests

Community. Available at http://

against all possible data sets.

wikibon.org/wiki/v/Big_Data:_ Hadoop,_Business_Analytics_and_

D a t a S a m p l i n g : Data sampling gains

Beyond, Mar 2012.

significance in Big data implementation and

4. Informatica Enterprise Data Integration

it becomes the testers’ job to identify suitable

(1998), Data verification using File and

sampling techniques that includes all critical

Table compare utility for HDFS and Hive

business scenarios and the right test data set.

tool. Available at https://community. informatica.com/solutions/1998.

Automation: Automate the test suites as much

5. Bhandarkar M. (2009), Practical Problem

as possible. The Big data regression test suite

Solving with Hadoop, USENIX ‘09

will be used multiple times as the database will

annual technical conference, June 2009.

be periodically updated. Hence an automated

Available at http://static.usenix.org/

regression test suite should be built to use it

event/usenix09/training/tutonefile.html.

after reach release. This will save a lot of time

6. N a g a n a t h a n , V . ( 2 0 1 2 ) , I n c r e a s e

during Big data validations.

Business Value with Cloud-based QA Environments, Available at http://www.

CONCLUSION

infosys.com/IT-services/independent-

Data quality challenges can be encountered by

validation-testing-services/Pages/

deploying a structured testing approach for both

cloud-based-QA-environments.aspx.

72

Infosys Labs Briefings VOL 11 NO 1 2013

Nature Inspired Visualization of Unstructured Big Data By Aaditya Prakash

Reconstruct self-organizing maps as spider graphs for better visual interpretation of large unstructured datasets

E

xponential growth of data capturing devices

A novel approach in unsupervised

has led to an explosion of data available.

machine learning is Self-Organizing Maps

Unfortunately not all data available is in the

(SOM). Along with classification, SOMs

database friendly format. Data which cannot

have added benefit of dimensionality

be easily categorized, classified or imported

reduction. SOMs are also used for visualizing

into database are termed Unstructured Data.

multidimensional data into 2D planar diffusion

Unstructured data is ubiquitous and is assumed

map. This achieves data reduction thus

to be around 80% of all data generated [1].

enabling visualization of large datasets.

While tremendous advancements have taken

Present models used to visualize SOM

place for analyzing, mining and visualizing

maps lack any deductive ability that may be

structured data, the field of unstructured

defeating the power of SOM. We introduce

data, especially unstructured Big data is still

better restructuring of SOM trained data for

in nascent stage.

more meaningful interpretation of very large

Lack of recognizable structure and

data sets.

huge size makes it very challenging to work

Taking inspiration from the nature,

with unstructured large datasets. Classical

we model the large unstructured dataset into

visualization methods limit the amount of

spider cobweb type graphs. This has the benefit

information presented and are asymptotically

of allowing multivariate analysis as different

slow with rising dimensions of the data. We

variables can be presented into one spider

present here a model to mitigate these problems

graph and their inter-variable relations can be

and allow efficient and vast visualization of

projected, which cannot be done with classical

large unstructured datasets.

SOM maps.

73

UNSTRUCTURED DATA

faster, may be quantum computing someday, it

Unstructured data come in different formats

promises greater role for the data. While there

and sizes. Broadly the textual data, sound,

has been a lot of effort to bring some structure

video, images, webpages, logs, emails, etc., are

into unstructured data [6], the cost of doing so

categorized into unstructured data. In some

has been the hindrance. With larger datasets

cases even a bundle of numeric data could

it is even a greater problem as it entails more

be collectively unstructured, for e.g., health

randomness and unpredictability in the data.

records of a patient. While a table of cholesterol

Self-Organizing Maps (SOM) are a

level of all the patients is more structured,

class of artificial neural networks proposed by

all the biostats of a single patient is largely

Teuvo Kohonen [7] that transform the input

unstructured.

dataset into two dimensional lattice, also called

Unstructured data could be of any form

Kohonen Map.

and could contain any number of independent variables. Labeling as is done in machine

Structure

learning is only possible with data where

All the points of the input layer are mapped

information of variable such as size, length,

onto two dimensional lattice, called as Kohonen

dependency, precision, etc., is known. Even

Network. Each point in the Kohonen Network

extraction of the underlying information in a

is potentially a Neuron.

cluster of unstructured data is very challenging because it is not known on what is to be extracted [2]. The potential of hidden analytics within the unstructured large datasets could be a valuable asset to any business or research entity. Consider the case of Enron emails (collected and prepared by CALO project). Emails are primarily unstructured, mostly because people often reply above the last email even when the new email’s content and purpose might be different. Therefore most organizations do not analyze emails or logs but several researchers analyzed

Figure 1: Kohonen Network Source: Infosys Research

the Enron emails and their results show that lot of predictive and analytical information could be obtained from the same [3, 4, 5].

Competition of Neurons SELF ORGANIZING MAPS

Once the Kohonen Network is completed the

Ability to harness the increased computing

neurons of the network compete according

power has been a great boon to business.

to the weights assigned from the input layer.

From traditional business analytics to machine

Function used to declare the winning neuron

learning, the knowledge we get from data is

is the simple Euclidean distance of the input

invaluable. With computing forecasted to get

point and its corresponding weight for each of

74

the neuron. The function called as discriminant

Since the formation of topological

function is represented as,

structuring is independent of the input points it can easily be parallelized. Carpenter et.al. have demonstrated the ability of SOM to work under massively parallel processing[9]. Kohonen himself has shown that even where the input data may not be in vector form, as found in some unstructured

where, x = point on Input Layer

data, large scale SOM can be run nonetheless[10].

w = weight of the input point (x)

i = all the input points

SOM PLOTS

j = all the neurons on the lattice

SOM plots are a two dimensional representation

d = Euclidean distance

of the topological structure obtained after training the neural nets for given number of

Simply put, the winning neuron is the one

repetitions and with given radius. The SOM

whose weight is closest (distance in lattice)

can be visualized as a complete 2-D topological

to the input layer. This process effectively

structure [Fig.2].

discretizes the output layer. Cooperation of Neighboring Neurons Once the winning neuron is found, the topological structure can be determined. Similar to the behavior in human brain cells (neurons), the winning neuron also excites its neighbor. Thus the topological structure is determined by the cooperative weights of the winning neuron and its neighbor. Figure 2: SOM Visualization using Rapidminer (AGPL Open Source) Source: Infosys Research

Self-Organization The process of selecting winning neurons and formation of topological structure is adaptive. The process runs multiple times to converge on the best mapping of the given input layer.

Figure 2, shows the overall topological

SOM is better than other clustering algorithms

structure obtained after dimensionality

in that it requires very few repetitions to get to

reduction of multivariate dataset. While

a stable structure.

the graph above may be useful for outlier detection or general categorization it isn’t very useful in analysis of individual variables.

Parallel SOM for large datasets Among all classifying machine learning

Other option of visualizing SOM is to

algorithms, convergence speed of the SOM has

plot different variables in grid format. One

been found to be the fastest [8]. This implies that

can use R programming language (GNU Open

for large data sets SOM is the best viable model.

Source) to plot the SOM results.

75

Figure 3: SOM Visualization in R using the Package ‘Kohonen’ Source: Infosys Research

Figure 4: SOM visualization in R using the package ‘SOM’ Source: Infosys Research

Note on running example

SPIDER PLOTS OF SOM

All the plots presented henceforth have been

As we have seen in the Figures 2, 3 and 4

obtained using R programming language.

the current visualization of SOM output

Dataset used is SPAM Email Database. Database

could be improved for more analytical

is in public domain and freely available for

ability. We introduce a new method to plot

research at ‘UCI Machine Learning Repository’.

SOM output especially designed for large

It contains 266858 word instances of 4601

datasets.

SPAM emails. Emails are good example of unstructured data.

Algorithm

Using the public packages in

1. Filter the results of SOM.

R, we obtain the SOM plots.

2. Make a polygon with as many sides as

Figure 3, is the plot of SOM trained result

the variables in input.

using the package ‘Kohonen’[11]. This plot gives

3. Make the radius of the polygon to

inter-variable analysis. In this case variable

be the maximum of the value in the

being 4 of one the most used words in the SPAM

dataset.

database viz. ‘order’, ‘credit’, ‘free’ and ‘money’.

4. Draw the grid for the polygon.

While this plot is better than topological plot as

5. Make segments inside the polygon

given in Figure 2, it is still difficult to interpret

if the strength of the two variables

the result in canonical sense.

inside the segment is greater than the

Figure 4, is again the SOM plot of

specified threshold.

the above given four most common words

6. Loop Step v for every variable against

in the SPAM database but this one uses the

every other variable

package called ‘SOM’[12]. While this plot is

7. C o l o r t h e s e g m e n t s b a s e d o n t h e

numerical and gives strength of intervariatek

frequency of variable.

relationship it does not help in giving us the

8. Color the line segments based on

analytical picture. The information obtained is

the threshold of each variable pair

not actionable.

plotted.

76

Figure 5: SOM Visualization in R Using the Above Algorithm: Showing Segment, i.e., inter-variable dependency Source: Infosys Research

Figure 7: Spider Plot showing 25 Sampled Words from the Spam Database Source: Infosys Research

Plots

words in the Spam database. The number of

As we can see, this plot is more meaningful than

threads between one variable to another shows

the SOM visualization plots obtained before.

the probability of second variable given the

From the figure we can easily deduce that the

first variable. Several threads between ‘free’

words ‘free’ and ‘order’ do not have similar

and ‘credit’ suggests that Spam emails offering

relation as ‘credit’ and ‘money’. Understandably

‘free credit’ (disguised in other forms by fees or

so, because if a Spam email is selling something,

deferred interests) are among the most popular.

it will probably have the words ‘order’ and

Using these Spider plots we can analyze

conversely if it is advertising any product or

several variables at once. This may cause the

software for ‘free’ download then it wouldn’t

graph to be messy but sometimes we need to see

have the words ‘order’ in it. High relationship

the complete picture in order to make canonical

between ‘credit’ and ‘money’ signifies Spam

decisions about the dataset.

emails advertising for better ‘Credit Score’

From Figure 7 we can see that even

programs and other marketing traps.

though the figure shows 25 variables it is not

Figure 6 shows the relationship of each

as cluttered as a Scatter Plot or Bar chart would

variable-- in this case four popular recurring

be if plotted with 25 variables.

Figure 6: SOM visualization in R using Above Algorithm: Showing Threads, i.e., inter-variable strength) Source: Infosys Research

Figure 8: Uncolored Representation of Threads in Six variables Source: Infosys Research

77

Figure 8 shows the different levels of

http://clarabridge.com/default.

strength between different variables. While

aspx?tabid=137.

‘contact’ variable is strong with ‘need’ but not

2. Doan, A., Naughton, J. F., Ramakrishnan,

enough with ‘help’ it is no surprise that ‘you’

R., Baid, A., Chai, X., Chen, F. and Vuong,

and ‘need’ are strong. Here the idea was only to

B. Q. (2009), Information extraction

present the visualization technique and not the

challenges in managing unstructured

analysis of Spam dataset. For more analysis on

data, ACM SIGMOD Record, vol. 37, no.

Spam filtering and Spam analysis one may refer

4, pp. 14-20.

to several independent works on the same [13, 14].

3. Diesner, J., Frantz, T. L. and Carley, K. M. (2005). Communication networks

ADVANTAGES

from the Enron email corpus “It’s always

There are several visual and non-visual

about the people. Enron is no different”.

advantages of using this new plot against

In Computational & Mathematical

the existing plot obtained. This plot has been

Organization Theory, vol. 11, no. 3, pp.

designed to handle Big data. Most of the existing

201-228.

plots mentioned above are limited in their

4. Chapanond, A., Krishnamoorthy, M.

capacity to scale. Principally if the range of data

S., & Yener, B. (2005), Graph theoretic

is large then most of the existing plots tend to

and spectral analysis of Enron email

get skewed and important information is lost.

data. In Computational & Mathematical

By normalizing the data this new plot prevents

Organization Theory, vol. 11, no.3, pp.

this issue. By allowing multiple dimensions

265-281.

to be incorporated allows for recognition of

5. Peterson, K., Hohensee, M., and Xia, F.

indirect relationships.

(2011), Email formality in the workplace: A case study on the enron corpus.

CONCLUSION

In Proceedings of the Workshop on

While unstructured data is abundant, free and

Languages in Social Media, pp. 86-95.

hidden with information the tools of analyzing

Association for Computational Linguistics.

the same are still nascent and cost of converting

6. Buneman, P., Davidson, S., Fernandez,

them to structured form is very high. Machine

M., and Suciu, D. (1997), Adding

learning is used to classify unstructured data

structure to unstructured data. Database

but comes with issues of speed and space

Theory—ICDT’97, pp. 336-350.

constraints. SOM are the fastest machine

7. Kohonen, T. (1990),The self-organizing

learning algorithms but their visualization

map. Proceedings of the IEEE, vol. 78,

powers are limited. We have presented a

no. 9, pp. 1464-1480.

naturally intuitive method to visualize SOM

8. Waller, N. G., Kaiser, H. A., Illian, J. B.,

outputs which facilitates multi-variable analysis

and Manry, M. (1998), A comparison

and is also highly scalable.

of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three

REFERENCE 1. Grimes, S., Unstructured data and

hierarchical cluster analysis algorithms.

the 80 percent rule. Retrieved from

Psychometrika, vol. 63, no.1, pp. 5-22.

78

9. Carpenter, G. A., and Grossberg, S.

12. Yan, J. (2012), Self-Organizing Map (with

(1987), A massively parallel architecture

application in gene clustering) in R.

for a self-organizing neural pattern

Available at http://cran.r-project.org/

recognition machine. Computer vision,

web/packages/som/som.pdf.

graphics, and image processing, vol. 37,

13. Dasgupta, A., Gurevich, M., & Punera,

no. 1, pp. 54-115.

K. (2011), Enhanced email spam filtering

10. Kohonen, T., and Somervuo, P. (2002),

through combining similarity graphs.

How to make large self-organizing maps

In Proceedings of the fourth ACM

for non-vectorial data. Neural Networks,

international conference on Web search

vol.15, no. 8, pp. 945-952.

and data mining, pp. 785-794.

11. Wehrens, R & Buydens, L.M.C (2007),

14. Cormack, G. V. (2007), Email spam

Self- and Super-organizing Maps in

filtering: A systematic review.

R: The Kohonen Package. Journal of

Foundations and Trends in Information

Statistical Software, vol. 21, no. 5, pp. 1-19.

Retrieval, vol. 1, no. 4, pp. 335-455.

79

NOTES

Index Automated Content Discovery 48, 49,

Extreme Content Hub, also ECH 47-51

Big Data

Global Positioning Service, also

Analytics 4-8, 19, 24, 40-43, 45, 67,

GPS 10, 13, 17, 54, 56

Lifecycle 21,

Management

Medical Engine 42- 44

Business Process, also BPM 30,

Value, also BDV 27, 29,

Custom Relationship, also CRM 28-30

Campaign Management 31, 32,

Information 3, 56-57

Common Warehouse Meta-Model, also CWM 7

Liquidity Risk, also LRM 35-40

Communication Service Providers, also

Master Data 5-6

CSPS 27,

Offer 32

Complex Event Processing, also CEP 53-63

Order 30

Content

Retention 31, 32

Processing Workflows 50

Metadata

Publishing Lifecycle Management, also

Discovery 6-7

CPLM 48,

Extractor 50,

Management System, also CMS 30, 48, 51

Governance 6-7

Contingency Funding Planning, also CFP 36,

Management 3-8

Customer

Net Interest Income Analysis, also NIIA 37

Dynamics 19-21, 25

Predictive

Relationship 28, 30

Intelligence 19

Data Warehouse 4- 5, 30, 38-39, 66, 68

Modeling 32

Enterprise Service Bus, also ESB 30

Analytics 54

Event Driven

Service Management 31, 33

Process Automation

Supply Chain Planning 9-12, 53

Architecture, also EDA 30-31

Un-Structured Content Extractor 50

Experience Personalization 31

Web Analytics 21

81

Infosys Labs Briefings BUSINESS INNOVATION through TECHNOLOGY

Editor Praveen B Malla PhD

Editorial Office: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore 560100, India Email: [emailprotected] http://www.infosys.com/infosyslabsbriefings

Deputy Editor Yogesh Dandawate Graphics & Web Editor Rakesh Subramanian Chethana M G Vivek Karkera IP Manager K V R S Sarma Marketing Manager Gayatri Hazarika Online Marketing Sanjay Sahay Production Manager Sudarshan Kumar V S Database Manager Ramesh Ramachandran Distribution Managers Santhosh Shenoy Suresh Kumar V H

Infosys Labs Briefings is a journal published by Infosys Labs with the objective of offering fresh perspectives on boardroom business technology. The publication aims at becoming the most sought after source for thought leading, strategic and experiential insights on business technology management. Infosys Labs is an important part of Infosys’ commitment to leadership in innovation using technology. Infosys Labs anticipates and assesses the evolution of technology and its impact on businesses and enables Infosys to constantly synthesize what it learns and catalyze technology enabled business transformation and thus assume leadership in providing best of breed solutions to clients across the globe. This is achieved through research supported by state-of-the-art labs and collaboration with industry leaders. About Infosys Many of the world’s most successful organizations rely on Infosys to deliver measurable business value. Infosys provides business consulting technology, engineering and outsourcing services to help clients in over 32 countries build tomorrow’s enterprise.

How to Reach Us: Email: [emailprotected]

For more information about Infosys (NASDAQ:INFY), visit www.infosys.com

Phone: +91 40 44290563 Post: Infosys Labs Briefings, B-19, Infosys Ltd. Electronics City, Hosur Road, Bangalore 560100, India Subscription: [emailprotected] Rights, Permission, Licensing and Reprints: [emailprotected]

© Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained herein or to any derived results obtained by the recipient from the use of the information in this document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising therefrom. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.

ART & SCIENCE AN INFOSYS PUBLICATION

the DIGITAL ENTERPRISE

U R HERE

REVOLUTION Social

Video

.com

Mobile

Big Data

Cloud

Search

Email

Apps

CL 60% UD OF SERVER WORKLOADS WILL BE VIRTUALIZED IN 2 YEARS 60%

62%

Fortune 500 companies with blogs

Fortune 500 companies active on Twitter

2014

MOBILITY

MOBILE STRATEGY

31

HOW THE WORLD GETS ONLINE 47 Ap ,00 do p St0 wn or loa e ds

695,0 status0u0 Facebook pdates

23%

12% 2008

IN ONE MINUTE ...

SOCIAL BUSINESS

0 00 0, ts 10 wee T

%

5.5b

via mobile

1.5b

via desktop

OF COMPANIES REPORT THEY ARE

JUST STARTING TO DEVELOP

ites 571w webs ne

A MOBILE STRATEGY OR HAVE NO MOBILE STRATEGY AT ALL.

2 million Google searches = 100 MILLION

ONLINE RETAIL U.S. OUTLOOK: GROWTH

$226B

45%

$327B

2012

2016

167 million people

192 million people

BIG DATA

90%

OF THE WORLD’S DATA WAS CREATED IN THE LAST

2

YEARS

APPS WHERE MOBILE USERS SPEND TIME (billions of minutes per month) 83 70

43

42

16

17

54

54

21

23

71

72

25

27

25

79

59

23

24

26

20

21

23

MAR APR MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR 2011 2012

Mobile Web

Click here to explore the current issue of Art & Science

72

101

88

Mobile Apps

Big Data: Countering Tomorrow’s Challenges Infosys Labs Briefings Advisory Board

Anindya Sircar PhD Associate Vice President & Head - IP Cell Gaurav Rastogi Vice President, Head - Learning Services Kochikar V P PhD Associate Vice President, Education & Research Unit Raj Joshi Managing Director, Infosys Consulting Inc. Ranganath M Vice President & Chief Risk Officer Simon Towers PhD Associate Vice President and Head - Center of Innovation for Tommorow’s Enterprise, Infosys Labs Subu Goparaju Senior Vice President & Head - Infosys Labs

Authors featured in this issue AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [emailprotected].

Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.

ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].

As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.

BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [emailprotected].

Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies. Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data. Like always do let us know your feedback about the issue. Happy Reading,

AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [emailprotected]. ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [emailprotected].

GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [emailprotected]. KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [emailprotected]. MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [emailprotected]. NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [emailprotected]. NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [emailprotected]. NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [emailprotected]. PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [emailprotected]. PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [emailprotected]. PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [emailprotected].

Yogesh Dandawate Deputy Editor [emailprotected]

SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [emailprotected]. SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [emailprotected]. SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [emailprotected]. ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [emailprotected].

“At Infosys Labs, we constantly look for opportunities to leverage

Senior Vice President

technology while creating and implementing innovative business

and Head of Infosys Labs

Infosys Labs Briefings

Subu Goparaju

solutions for our clients. As part of this quest, we develop engineering methodologies that help Infosys implement these solutions right,

For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: [emailprotected]

© Infosys Limited, 2013

BIG DATA: CHALLENGES AND OPPORTUNITIES

first time and every time.”

Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.

VOL 11 NO 1 2013

makes no express or implied warranties relating to the information contained in this document or to

Infosys Labs Briefings VOL 11 NO 1 2013

BIG DATA: CHALLENGES AND OPPORTUNITIES

$ £¥ € €

¥ £

$

[PDF] Infosys Labs Briefings - Free Download PDF (2024)

References

Top Articles
Kalispell Backpage - Dating Ads In Kalispell, Montana, United States
Wgu Academy Phone Number
Hardheid van drinkwater - Waterbedrijf Groningen
Fone Tech Cleveland Ms
Does Shell Gas Station Sell Pregnancy Tests
North Carolina Houses For Rent Craigslist
Rocket League Tracker Mmr Ranks
Seacrest 7 Piece Dining Set
Mcdonalds 5$
William Spencer Funeral Home Portland Indiana
Parents & Students · Infinite Campus
New York Rangers Hfboards
Nail Shops Open Sunday Near Me
Cato's Dozen Crossword
Acuity Eye Group - La Quinta Photos
Labcorp Locations Near Me
Evertote.ca
Tyrone's Unblocked Games Basketball
Math Nation Algebra 2 Practice Book Answer Key
3850 Colonial Blvd Suite 100 Fort Myers Fl 33966
Az511 Twitter
Merrick Rv Loans
Los Garroberros Menu
Louisiana Funeral Services and Crematory | Broussard, Louisiana
Runescape Abyssal Beast
north bay garage & moving sales "moving" - craigslist
Susan Dey Today: A Look At The Iconic Actress And Her Legacy
Meet Kristine Saryan, Scott Patterson’s Wife
Realidades 2 Capitulo 2B Answers
Mrballen Political Views
Mugshots Gaston Gazette
Preventice Learnworlds
Linktree Teentinyangel
We Tested and Found The Best Weed Killers to Banish Weeds for Good
The Listings Project New York
How To Delete Jackd Account
Thomas E Schneider Jeopardy
Oklahoma Craigslist Pets
Centricitykp
Probation中文
Franchisee Training & Support | Papa Johns Pizza Franchise UK
Delta Incoming Flights Msp
GW2 Fractured update patch notes 26th Nov 2013
Standard Schnauzer For Sale Craigslist
Myusu Canvas
Directions To Pnc Near Me
Craigslist In Killeen Tx
Victoria Maneskin Nuda
358 Edgewood Drive Denver Colorado Zillow
Espn Masters Leaderboard
FINAL FANTASY XI Online 20th Anniversary | Square Enix Blog
Codex Genestealer Cults 10th Edition: The Goonhammer Review
Latest Posts
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 6576

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.