Background and Approach
Amidst the hype about Big Data and the capabilities and benefits that it brings, the Hadoop open-source software has attracted much attention. Hadoop is now a key component to the analytical ecosystem for many organisations, but for the majority there is still some uncertainty about what it can be used for. This paper summarises research findings into identifiable customer use cases for Hadoop, to highlight the practical deployment of Hadoop and provide an understanding of how it is actually being deployed and used; and the benefits delivered.
The research has examined the customer case studies from the three main Hadoop distributors: Cloudera, Hortonworks, and MapR. In total close to 200 case studies have been included in the research, although the focus has been on those with identifiable organisations that have deployed Hadoop – excluding, for example, all annonymised case studies. The remaining 177 reference customers have been analysed across a range of dimensions to understand their industry alignment, the common use cases, and key benefits that organisations have derived from adopting Hadoop.
The result is an insightful summary for organisations that are considering if Hadoop has a place in their architecture. As the analytical technology landscape becomes more complex it is also harder to distinguish hype and vendor marketing claims from tangible reality. This paper provides a summary of the reality of Hadoop usage – albeit via the case studies highlighted by vendors. As a result there are some cautions in relying solely on these use cases, and these are highlighted in conjunction with each area the research covered, together with some concluding observations.
Hadoop has been adopted by a wide range of industries, but there is a reported concentration in a small number of sectors. Media & Entertainment businesses account for almost a third of reported use cases (31%); a key proportion of these are dot-com businesses and organisations that support dot-com’s; for example advertising platforms (Comcast, Rubicon), recommendation engines (Beats Music, Quantium) or marketing agencies (Kenshoo, Rapp). Technology is the next significant sector with 44 use cases, 24% of the total. These are a mixture of business, and a range of use cases – including manufacturing log data (HP, Western Digital), device and service usage log data (Symantec, Odyssey), security services (Dell, Terbium), customer experience and personalisation (Nokia, Gravity, Webtrends), Marketing Optimisation (Cisco, Rocket Internet).
The Financial Services use cases [n22] and Telco [n16] together represent 21% of customer case studies. Amongst these the majority are service companies for the sector, rather than end-customer facing. Service companies include organisations like: Financial Services: Cardlytics (spend analytics), and Experian. And in Telco: Razorsight (predictive analytics), JiWire (Location-based advertising)The customer-facing businesses often have limited detail on their actual usage e.g. Commonwealth Bank, VISA, though there are some that provide useful detail (Barclays, CapitalOne, Rogers, TMobile)
Healthcare is the final significant stand-alone sector, with 8% of use cases [n15]. These include Healthcare Providers (Mercy, UC Irvine Health), Insurers (Zirmed), Clinical Studies and health research (Children’s Healthcare, National Cancer Institute).
These five sectors represent 83% of use cases; the remaining sectors have only a handful of use cases, often with two small a sample to make useful generalisation.
The skew of case studies to these five sectors can be attributed to a range of factors. The tech, media and dot-com sectors have been early adopters of open source and new analytical technologies, so it’s not to be unexpected that the documented case studies are biased towards them. In addition, for many companies in these sectors, data is at the heart of their business, and promotion of what they are doing in analytics is good publicity for them. For other industry sectors, especially financial services and telcos, there are a couple of factors behind the lower number of use cases. They are not such early adopters of technology in the way that tech and media companies are, so they are a little behind the curve. Also they tend to be a bit more reticent about publicity; when technology becomes more mainstream, then the leaders like to demonstrate their superior capability, but when technology is new there is a tendency to keep it low-key, partially to use their first-mover position to exploit competitive advantage, but also to avoid any adverse publicity with heightened awareness of data and security threats. Additionally there is also a lag between organisations concluding an implementation and vendors getting this written up and published.
So companies in these less represented sectors should not be so concerned that the numbers of use cases are lower, instead they should look more at the kinds of use cases and drivers. The range of use cases are demonstrating that Hadoop can be applicable to a wide range of sectors and organisations.
The second stage of research identified the most commonly occurring usage of Hadoop; identified from the reference use cases. The numbers behind the 73 use cases identified as having definitive ‘usage types’, are illustrated in the graph on the next page.
In recent years, ETL-offload has often been referenced as a common use case for Hadoop adoption, especially for organisations with an existing data warehouse that has pressure on storage capacity, processing capacity or on completing ETL processing during available windows. The review of use cases has identified a number of instances of organisations using Hadoop in this way, with some specifically highlighting this as their initial use case, before widening scope of use. For example Telkomsel (Indonesian Telco) initially adopted Hadoop “to offload extract, transform, and load (ETL) operations from the data warehouse for more cost-effective data processing and faster time to realize insights across its business.”
Similarly BT, in the UK, adopted Hadoop to solve constraints in its overnight batch process around business customer data; they recognised this as “basically a data velocity problem. We need to process that data faster and increase the volume”. BT then moved on to use Hadoop for detailed network analyses, and is now planning to use Hadoop in support of maintaining ‘quality of service’ for its TV services. And Expedia, started with Hadoop by “first just solving the basic problems of ETL; pushing some of the ETL off the enterprise data warehouse”. Expedia like many organisations found the ETL challenge just an initial start-point; “the real interesting problems to solve, are the ones where we are taking data and actioning it”. Turning that data into meaningful information for decision making.
Other organisations have focussed less on ingest and more on the provision of a long term archive, to store more data history. Neustar, a telco services company, is a prime example: they “used to capture less than 10% of its network data and retain it for 60 days.” With Hadoop, they now “capture 100% and retain it for two years.” Similarly IDEXX Laboratories the recognised their customers, veterinary practices, want “data analyzed in different ways, including more historical data, which means having to retain data longer”. And MediaHub Australia has used Hadoop “to develop an innovative archive system” for its growing content repository.
The most common application for Hadoop is advanced predictive capabilities. Examples include: Razorsight, who have used Hadoop to launch new Predictive Analytics Solutions for Telecoms, and predictive media specialist, Kenhsoo who, through Hadoop, can “perform new kinds of algorithms, and run more complicated models”. Noble Energy uses Hadoop to “predict and prevent down time in their infrastructure”, and Climate Corporation use Hadoop for machine learning to predict the weather for agribusiness. These four examples all illustrate how organisations can provide advanced analytical services to other business. Hadoop can also provide analytics that support end-users; BeatsMusic for example “analyse the high volume of data from our users and make music recommendations immediately personalized to them.” Similarly Hadoop is enabling Webtrends to “provide tailored predictive, behavioral analysis and ROI insights to its customers across many different verticals”.
A wide range of use cases detail the way that Hadoop has enabled organisations to undertake more granular behavioural analytics; the WebTrends example highlights that behavioural analytics is often use to support advanced predictive analytics. The two elements are often (though not always) connected.
Cardlytics is “innovating how enterprises can leverage consumer spend behaviour”. Marks & Spencer is using Hadoop to provide “ a better understanding of consumer behaviour in a multichannel environment and improved device attribution modelling.” Eastern Bank is using Hadoop to “go beyond what a customer has with the Bank and giving visibility into what a customer does with the Bank through behavioral analytics”
Many organsiations have used Hadoop as a data lake, to gather together a more comprehensive collection of data, including non-relational data, external data and open data, and therefore provide a more comprehensive view of the customers, processes etc. Cardinal Health has used the Hadoop platform to “enrich its existing data with freely available public datasets.” FINRA is using Hadoop to find “evidence of market manipulation by assembling 50 billion market events into a holistic picture of the U.S. securities market every day.” TMobile created a data lake and “was able to quickly glean new customer insights that previously could not be seen from small sets of data.” Harte Hanks has used Hadoop to “enhance the performance, scalability and flexibility of their solutions so clients can more easily and quickly integrate, analyze and store massive quantities of data without impacting performance, ……..to integrate all kinds of digital data, survey data, reference points and more”.
The final areas to highlight are case studies that reference analytics against specific types of data. Analytics against log or machine data, for example, are examples of the use of Hadoop for non-relational datasets. Western Digital captures all manufacturing sensor data in Hadoop (previously it just captured a subset) and this provides “continuous improvement in its manufacturing process, which lowers costs and improves customer satisfaction.” Orbitz travel websites process “millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day”. Storing these logs in Hadoop has enabled applications such as improvement of hotel search results. Orbitz have looked to “identify consumer preferences in order to determine the best performing hotels to display to users, thus leading to more bookings”. MillennialMedia collects “roughly 3-4 terabytes of log data per day from their ad servers”. This data is used “primarily as an input to the optimization systems for their mobile ad campaigns, where they can place ads for clients and evaluate the effectiveness of those campaigns by analyzing media and site performance as well as consumer behaviors.”
Organisations are also exploring text files. Progressive Insurance has been using Hadoop for “mining through claims notes”. LinkSmart provides content-linking solutions and has deployed a Hadoop environment for text files that “makes sense of the grammar and sentiment of those words”. Patterns and Predictions is “a predictive analytics firm with a core technology that provides unstructured and linguistics driven prediction.”
Terbium offer continuous, proactive monitoring of critical data, and used Hadoop as the foundation of its solution for “registering fingerprints of companies’ most valuable data and comparing them to ones gathered from across the Internet, Terbium’s Matchlight system can discover and alert companies immediately and automatically if their data appears in unexpected places on the internet, including the dark web.
These examples provide an illustration of the growing range of applications to which Hadoop is being utilised, but shows the concentration in some specific areas of analytics. Many organisations that don’t have Hadoop are already doing similar kinds of analytics, especially on relational data types, on existing platforms. What Hadoop provides is an ability to increase capacity to conduct such analytics – the next section on benefits expands on this capability.
The final element of the research identified the common benefits highlighted in case studies of organisations that have adopted Hadoop. Amongst the 177 identifiable customers, there were a number that had insufficient detail of benefits to be included in this element of the research. Removing these resulted in 116 specific customer case studies where there was an indication of one or more specific areas of benefit. The summary findings are shown in the chart below.
Although there is some similarity in categories between this section and the ‘use case section’ (e.g. new types of data) the research included case studies in this ‘benefits’ section where the published text clearly highlights the element in question as a recognised benefit. For the usage section, use cases were included where the element was described as being an objective of the implementation.
The most reported benefit was Scalability – with 65% of case studies specifically highlighting this as a benefit from deploying Hadoop. Next most reported was the Speed of Analytics – with 57% of use cases indicating that deploying Hadoop resulted in faster execution of their analytics, either in comparison to alternate choices, or more often, in comparison to previous technologies. There is an element of caution here; the research is primarily based on vendor produced case studies, and all vendors like highlighting x-times speed improvements. The challenge here is that these are not like-for-like comparisons; any ‘new’ technology will be faster than ‘old-technology’, even versions of the same hardware or software. As a result the x-faster metric is often pretty meaningless and it is incredibly difficult to compare to anything meaningful. However it’s insightful to know that for many organisations getting analytics executed faster, and with predictability is a recurring theme.
Alongside this, 17% of use cases, highlighted Speed of Deployment – in terms of the way that Hadoop has helped them accelerate the deployment of new analytics solutions. This speed-to-implementation, alongside speed of execution is of key value – creating new analytics faster, then knowing they run consistently faster is a huge benefit.
Next, and perhaps for some surprising as the third (and not first) most reported element was Cost. Hadoop was highlighted by 39% [significant, but interestingly less than 50%] of case studies as providing a significant cost advantage over alternate choices or previous technology. The low cost of Hadoop, influenced by its open-source foundation, and further enhanced by its use of commodity hardware has been a significant driver in organisations looking at options for deploying Hadoop – it’s interesting that scalability and analytical speed are mentioned ahead of pure cost (and both occur in more than half of use cases).
The next common factor was the ability of Hadoop to accommodate New Types of Data, not typically (or historically) used for analytics. 28% of case studies highlighted that adopting Hadoop specifically enabled them to incorporate new data types into analytics. These included sources such as text files, web logs, machine and sensor data, etc. that is often non-relational in format, or data of a volume that could not be accommodated in their existing analytical environments.
Two factors equally held the next spot in ranking, each with 26 case studies (22% of total):
Single Data Platform: many use cases highlighted the capability of Hadoop to capture and store all types of data – and therefore to become a single analytical repository for the organisation. This ties to two other factors, that of the scalability of Hadoop and also its capability to bring in new forms of data, including non-relational, and emphasises the fact that Hadoop is a file store, not a database. What was interesting is that there are also many case studies that were very specific in highlighting that their Hadoop environment did not (and would not) replace their traditional Data Warehouse environment, and that the two worked in harmony. This is an area where organisations need to evaluate options, and understand the choices; based on their specific standpoint.
Real-Time Analytics: a similar number of case studies highlighted the use of Hadoop to provide analytics in near-real-time, again highlighting the scale, performance and predictability of the software.
The final two elements were:
New Products: the ability of Hadoop to support the provision of new products (or services) by an organisation, or the use of analytics to highlight or identify a new product or requirement. Over one-in-eight case studies identified this as a specific benefit.
New Analytics: for many, the reality of practical big data (as opposed to hype) is a combination of two factors: the ability to analyse a broader array of data types, including external data and non-relational elements (covered above) and the ability to undertake new types of analytics not previously undertaken: new algorithms, new techniques (e.g. text or sentiment)
The above benefit factors can all be looked at from two angles – firstly by highlighting that adoption of Hadoop is clearly showing these benefits to organisations; so it provides evidence that Hadoop has real-tangible value. Secondly they provide organisations with a checklist of factors they should be using in assessing the current state of their analytical environments, and to help prioritise what they want to achieve in enhancing their analytical capability.
Is it scalable? Does it provide suitable performance levels? Is it considered to be cost effective? Does it support all required data types, and all required analytical techniques? Understanding which of these provides constraint helps to prioritise approach and evolution.
The above benefit factors were specifically highlighted during the review of use cases; in addition there were a number of recurring factors that were not specifically quantified:
- The Flexibility of Hadoop was commonly referenced; this covers a range of elements including its ability to store a wide range of data sources, not just relational. Also the late-binding nature of Hadoop; where data doesn’t need to be loaded into a pre-defined schema, instead the schema can be defined at query time (schema on read, not schema on write). This emphasises data loading over data access; and provides perceived flexibility
- Speed of deployment was a very common highlight. Hadoop, especially when provided via one of the supported packages, is very easy to deploy; organisations clearly expected the implementation to be longer and more complex than it was; it’s a clear sign of how Hadoop is maturing.
- In a similar vein there was widespread comment of the stability of Hadoop distributions; this was commonly referenced in terms such as availability, reliability, stability and resilience. Many customers also highlighted how little management time it required, and how simple this management was.
- A final recurrence, were comments regarding the security capabilities of Hadoop, the confidence organisations had in the data controls in Hadoop and its ability to fulfil data protection requirements in addition to defined security requirements.
This paper highlights the recurring themes of the published case studies for organisations that have adopted Hadoop. It therefore demonstrates the industries that are pioneering use of Hadoop to expand analytical capability; and acts as a guide to other sectors as to the use cases and benefits that have been reported by specific organisations. It should provide organisations contemplating how they should evolve their analytical ecosystem with insight as to the factors they should be considering, if Hadoop is an element under consideration.
Kevin Long, Independent Consultant, wide ranging experience in data and analytics.