The concept of big data has been around for years; most organizations now understand that if they capture all the data that streams into their businesses, they can apply analytics and get significant value from it. But even in the 1950s, decades before anyone uttered the term “big data,” businesses were using basic analytics (essentially numbers in a spreadsheet that were manually examined) to uncover insights and trends.
The new benefits that big data analytics brings to the table, however, are speed and efficiency. Whereas a few years ago a business would have gathered information, run analytics and unearthed information that could be used for future decisions, today that business can identify insights for immediate decisions. The ability to work faster – and stay agile – gives organizations a competitive edge they didn’t have before.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing business.
Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately – and make decisions based on what they’ve learned.
New products and services. With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. Davenport points out that with big data analytics, more companies are creating new products to meet customers’ needs.
No single business trend in the last decade has as much potential impact on incumbent IT investments as big data. Indeed big data promises—or threatens, depending on how you view it—to upend legacy technologies at many big companies. As IT modernization initiatives gain traction and the accompanying cost savings hit the bottom line, executives in both line of business and IT organizations are getting serious about the technology solutions that are tied to big data.
Companies are not only replacing legacy technologies in favour of open source solutions like Apache Hadoop, they are also replacing proprietary hardware with commodity hardware, custom-written applications with packaged solutions, and decades-old business intelligence tools with data visualization. This new combination of big data platforms, projects, and tools is driving new business innovations, from faster product time-to-market to an authoritative—finally!—single view of the customer to custom-packaged product bundles and beyond.
Big Data Stack
As with all strategic technology trends, big data introduces highly specialized features that set it apart from legacy systems. Each component of the stack is optimized around the large, unstructured and semi-structured nature of big data. Working together, these moving parts comprise a holistic solution that’s fine-tuned for specialized, high-performance processing and storage.
Storing large and diverse amounts of data on disk is becoming more cost-effective as the disk technologies become more commoditized and efficient. Companies like EMC sell storage solutions that allow disks to be added quickly and cheaply, thereby scaling storage in lock step with growing data volumes. Indeed, many big company executives see Hadoop as a low-cost alternative for the archival and quick retrieval of large amounts of historical data.
The big data “platform” is typically the collection of functions that comprise high-performance processing of big data. The platform includes capabilities to integrate, manage, and apply sophisticated computational processing to the data. Typically, big data platforms include a Hadoop (or similar open-source project) foundation. Hadoop was designed and built to optimize complex manipulation of large amounts of data while vastly exceeding the price/performance of traditional databases. Hadoop is a unified storage and processing environment that is highly scalable to large and complex data volumes. You can think of it as big data’s execution engine. The Lead Information Architect in a large property and casualty insurer.
In the new world of big data, open source projects like Hadoop have become the de facto processing platform for big data. Indeed, the rise of big data technologies has meant that the conversation around analytics solutions has fundamentally changed. Companies unencumbered with legacy data warehouses can now leverage a single Hadoop platform to segregate complex workloads and support a variety of usage scenarios from complex mathematical computations to ad hoc visualizations.
The expanse of big data is as broad and complex as the applications for it. Big data can mean human genome sequences, oil well sensors, cancer cell behaviors, locations of products on pallets, social media interactions, or patient vital signs, to name a few examples. The data layer in the stack implies that data is a separate asset, warranting discrete management and governance.
A survey of data management professionals found that of the 339 companies responding, 71 percent admitted that they “have yet to begin planning” their big data strategies. The respondents cited concerns about data quality, reconciliation, timeliness, and security as significant barriers to big data adoption.
Application Code, Functions, and Services:
Just as big data varies with the business application, the code used to manipulate and process the data can vary. Hadoop uses a processing engine called MapReduce to not only distribute data across the disks, but to apply complex computational instructions to that data. In keeping with the high-performance capabilities of the platform, MapReduce instructions are processed in parallel across various nodes on the big data platform, and then quickly assembled to provide a new data structure or answer set.
An example of a big data application in Hadoop might be to “calculate all the customers who like us on social media.” A text mining application might crunch through social media transactions, searching for words such as “fan,” “love,” “bought,” or “awesome” and consolidate a list of key influencer customers.
Depending on the big data application, additional processing via MapReduce or custom Java code might be used to construct an intermediate data structure, such as a statistical model, a flat file, a relational table, or a cube. The resulting structure may be intended for additional analysis, or to be queried by a traditional SQL-based query tool. This business view ensures that big data is more consumable by the tools and the knowledge workers that already exist in an organization.
One Hadoop project called “Hive” enables raw data to be re-structured into relational tables that can be accessed via SQL and incumbent SQL-based toolsets, capitalizing on the skills that a company may already have in-house.
Presentation and Consumption:
One of the more profound developments in the world of big data is the adoption of so-called data visualization. Unlike the specialized business intelligence technologies and unwieldy spread sheets of yesterday, data visualization tools allow the average business person to view information in an intuitive, graphical way.
For instance, if a wireless carrier wants better information on its network, looking in particular for patterns in dropped calls, it could assemble a complicated spread sheet full of different columns and figures. Alternatively, it could deploy an easy-to-consume, graphical trend report to its field service staff.
Data visualizations, although normally highly appealing to managerial users, are more difficult to create when the primary output is a multivariate predictive model; humans have difficulty understanding visualizations in more than two dimensions. Some data visualization tools now select the most appropriate visual display for the type of data and number of variables. Anyway, if the primary output of a big data analysis is an automated decision, there is no need for visualization.
Changing the Vocabulary
The brave new world of big data not only upends the traditional technology stack when it comes to high-performance analytics, it challenges traditional ways of using and accessing data. At the very least, it starts to change the company’s vocabulary about information and its role in analytics.
Big Data’s Value Proposition
As we engaged big company executives in conversations about big data for this report, they all agreed that they considered big data to be an evolutionary set of capabilities that would have new and sometimes unanticipated uses over time. But every one of these executives conceded that they couldn’t afford to make big data a mere academic exercise. It needed to drive value, and better sooner than later.
Very few companies have taken the steps to rigorously quantify the return on investment for their big data efforts. The reality is that the proof points about big data often transcend hard dollars represented by cost savings or revenue generation. This suggests that senior management is betting on big data for the long-term, a point corroborated by several of the executives we interviewed.
But initial comparisons of big data ROI are more than promising. In 2011, Wikibon, an open source knowledge sharing community, published a case study that compared the financial return of two analytics environments. The first environment was a high-speed data warehouse appliance employing traditional ETL and data provisioning processes. The second environment was big data running on a newer big data technology using massively-parallel (MPP) hardware.
Beyond Customers: Big Data’s Infinite Promise
Go to any industry conference or read a vendor brochure, and it seems as if everyone’s talking about big data driving new customer insights. Yes, you can troll through years’ worth of customer data and quickly see which high-value customers have a propensity to churn. You can calculate lifetime customer value. You can look at purchase patterns to see what a business customer might buy next. And you can develop micro-segments and corresponding microsites for key customer clusters, and communicate with them in increasingly relevant ways.
Strictly speaking, you don’t need big data for any of this. This is customer analytics—sometimes called “analytical CRM,” sometimes called “business intelligence.” And companies across industries and market segments have been doing it long before big data was a buzzword and statisticians were sexy.
It’s the consumer’s “digital footprint” from online purchases, in-store kiosk interactions, ATM transactions, and social media commentary that’s resulting in part of the big data explosion. These interactions make behavioural analytics and targeting that much richer—and more interesting to companies, their advertisers, and third-party data providers.
The consumer who searches the web for camping gear, top-of-the-line fly fishing rods, and family vacations packages may be a better candidate for a zero-interest loan on a four-wheel drive truck than the shopper comparing parkas with faux-fur trim. But depending on other interactions or interests revealed through richer behaviour and preference data, either might be a good candidate for an eco-friendly volunteer vacation. The examples of the power of big data analytics to drive customer loyalty are endless.
The Four V’s of Big Data
‘Big Data’ is the emerging discipline of capturing, storing, processing, analysing and visualising these huge quantities of information. The data sets may start at a few terabytes and run to many petabytes – far more than traditional data analysis packages can handle. In 2012 Gartner defined it as, ‘high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.’ This ‘3V’ classification has been built on since (particularly with the addition of veracity), such that Big Data is often described in terms of the following characteristics:
Volume. Terabytes or petabytes of data are analysed. An estimated 2.5 quintillion bytes of data (2.5 trillion gigabytes) are created every day, an amount which will only rise in the future. However, the size of the dataset is not the only variable that characterises Big Data.
Variety. The dataset may contain many different forms of data – not simply a large amount of the same type. The profusion of different kinds of mobile device and the variety of content consumed on them on a wide range of platforms, for example, means that companies can harvest data from an enormous array of sources, each telling them a different part of the same picture.
Velocity. Data may change on a constant basis. For example, modern cars may have 100 or so different sensors that continually monitor different aspects of performance. Markets change on a moment-to-moment scale. Data is highly fluid, and snapshots are not always enough.
Veracity. The data acquired may not all be accurate, or much of it may be uncertain or provisional in nature. Data quality is unreliable, especially when there is so much of it. Any system of analysis must take this into account.
In addition to the 4V characteristics, there are also two others to deal with:
Variability. Data capture and volume may be inconsistent, not just inaccurate, so varying quantities and qualities of data will be acquired at different times.
Together, these factors mean that managing the data can be an extremely complex process, since there are many data sources with differing types and formats of data, but these need to be correlated and made sense of if they are to be useful.
Big Data Companies
Due to the nature of Big Data, specialist companies have grown up around it in order to manage the volumes and complexity of information involved.
IBM Big Data Analytics
Like many other big data companies, IBM builds its offerings on Hadoop – so it’s fast, affordable and open source. It allows businesses to capture, manage and analyse structured and unstructured data with its BigInsights product. This is also available on the cloud (BigInsights on Cloud) to give the benefits of outsourcing storage and processing, providing Hadoop as a service. InfoSphere Streams is designed to enable capture and analysis of data in realtime for Internet-of-Things applications. IBM’s analytics enable powerful collating and visualisation of data with excellent flexibility for storage and management. You can also find plenty of downloadable documentation and white papers on their site.
HP Big Data
Another well-known name in IT, HP brings a wealth of experience to big data. As well as offering their own platform, they run workshops to assess organisations’ needs. Then, ‘when you’re ready to transform your infrastructure, HP can help you develop an IT architecture that provides the capacity to manage the volume, velocity, variety, voracity, and value of your data.’ The platform itself is based on Hadoop. HP look to add value beyond providing the software alone, and will consult with you to help you craft a strategy to help you make the most of the big data you collect – and how to go about it most efficiently.
Microsoft’s big data solutions run on Hadoop and can be used either in the cloud or natively on Windows. Business users can use Hadoop to gain insights into their data using standard tools including Excel or Office 365. It can be integrated with core databases to analyse both structured and unstructured data and create sophisticated 3D visualisations. Polybase is incorporated so users can then easily query and combine relational and non-relational data with the same techniques required for SQL Server. Microsoft’s solution enables you to analyse Hadoop data from within Excel, adding new functionality to a familiar software package.
Intel Big Data
Recognising that making the most of big data means changing your information architecture, Intel takes the approach of enabling enterprise to create a more flexible, open and distributed environment, whilst their big data platform is based on Apache’s Hadoop. They take a thorough approach that does not assume they know what your needs are, but presents a walkthrough to determine how best to help achieve your objectives. Intel’s own industry-standard hardware is at your disposal to optimise the performance of your big data project, offering speed, scalability and a cost-effective approach according to your organisation’s requirements.
Amazon Web Services
Amazon is a huge name in providing web hosting and other services, and the benefits of using them are unparalleled economies of scale and uptime. Amazon tend to offer a basic framework for customers to use, without providing much in the way of customer support. This means they are the ideal choice if you know exactly what you are doing and want to save money. Amazon supports products like Hadoop, Pig, Hive and Spark, enabling you to build your own solution on their platform and create your own big data stack. There are plenty of tutorials, video demos and guides to get you started as quickly and easily as possible.
Dell Big Data Analytics
Another well known and globally-established company, this time in the hardware space, Dell offers its own big data package. Their solution includes an automated facility to load and continuously replicate changes from an Oracle database to a Hadoop cluster to support big data analytics projects, thereby simplifying Oracle and Hadoop data integration. Data can be integrated in near real-time, from a wide range of data stores and applications, and from both on- and off-premises sources. Techniques such as natural language processing, machine learning and sentiment analysis are made accessible through straightforward search and powerful visualisation to enable users to learn relationships between different data streams and leverage these for their businesses.
Google is the big daddy of internet search: the outright market leader with the vast majority of search traffic to its name. No other search engine comes close, so perhaps it’s not surprising that Google should offer an analytics package to crunch through the phenomenal amount of data it produces in the course of its day-to-day work for millions of businesses around the world. It already hosts the hugely popular Google Analytics, but BigQuery is designed for a different order of magnitude of data. It puts Google’s impressive infrastructure at your disposal, allowing you to analyse massive datasets in the cloud with fast, SQL-like queries – analysing multi-terabyte datasets in just seconds. Being Google it’s also very scalable and straightforward to use.
Big data isn’t just an emerging phenomenon. It’s already here and being used by major companies to drive their business forwards. Traditional analytics packages simply aren’t capable of dealing with the quantity, variety and changeability of data that can now be harvested from diverse sources – machine sensors, text documents, structured and unstructured data, social media and more. When these are combined and analysed as a whole, new patterns emerge. The right big data package will allow enterprises to track these trends in real time, spotting them as they occur and enabling businesses to leverage the insights provided.
However, not all big data platforms and software are alike. As ever, which you decide on will depend on a number of factors. These include not just the nature of the data you are working with, but organisational budgets, infrastructure and the skillset of your team, amongst other things. Some solutions are designed to be used off-the-peg, providing powerful visualisations and connecting easily to your data stores. Others are intended to be more flexible but should only be used by those with coding expertise. You should also think to the future, and the long-term implications of being tied to your platform of choice – particularly in terms of open-source vs proprietary software.
Cloud News Daily (2016) ‘Guide to Big Data Analytics: Platforms, Software, Companies Tools, Solutions and Hadoop’ Cloud News Daily. [Online]. Available from: http://cloudnewsdaily.com/big-data-analytics/ [Accessed 17th September 2016].
QUbole (2016) ‘Big Data Analytics’ QUbole. [Online]. Available from: https://www.qubole.com/big-data-analytics/ [Accessed 17th September 2016].