Big Data has taken off as an increasingly mainstream technology, with adoption across verticals. Organizations are using analytics to mine nuggets of information from vast amounts of unstructured data they have amassed. Most of these implementations use physical hardware; does it make sense to move large workloads as these, into a private cloud?
The idea of moving resource intensive analytics workloads into a Cloud-based environment would be unacceptable to purists. Some of the main objections would be centered around the following issues.
- Resources consumed by the Hypervisor
- Interaction from other Cloud-based workloads
- Loss of control for the Big Data Administrator
Before addressing the above issues, let us take a look at the merits of Cloud-hosting workloads. We will use Hadoop as the reference, since it is the dominant platform for Big Data workloads. A recent talk by Richard McDougall (CTO, Storage and Application Services at VMware) addressed usage of Big Data workloads within Cloud environments. Some of the benefits that arise from moving such workloads into a Cloud are:
Moving from a physical to a virtual environment has greatly reduced the Time-to-Deploy for servers, which could be easily leveraged for these workloads as well. In a physical world, deploying another workload can take hours to days instead of just minutes, even with stored profiles and automated deployment. Granting additional resources to nodes is quite simple within the cloud as well. For example, if Datanode7 needs 2 more cores or 4GB additional memory (and Datanode3 has surplus resources), how easy is this with physical servers?
Hadoop combines compute and storage into the data node, which scales I/O throughput, but also limits its elasticity. Separating compute resources from storage, which is possible in a Virtual environment, enables compute elasticity. Compute resources can be allocated as needed to optimize performance. In addition, each workload can be scheduled to receive greater resources during specified times.
Compute resources are not shared within physical environments, and unused resources (CPU cycles, memory, etc.) are wasted. On the other hand, sharing these resources within a Cloud enables true multi-tenancy, and permits mixed workloads as well. This has the benefit of driving up utilization of host resources, as seen in the diagram below. Another benefit results from reduced Hadoop Cluster sprawl that arises from deploying a single purpose cluster for each workload.
A Cloud based environment already has a number of policies and templates to manage access. These can quickly be applied to a workload, which is a manual process in the physical cluster. Using an existing cluster for another workload during specific hours is difficult, and comes with security risks. Tasks such as making a copy of the Production dataset for a Development workload comes with its perils. These are greatly magnified when datasets are shared with partners or external companies in a PCI or HIPAA compliance environment.
Ultimately, organizations deploy Big Data workloads to derive insights. The lower the costs, and the sooner these insights are achieved, the higher will be the ROI. Agility and Elasticity increase performance, while increased utilization from multi-tenancy and reduced sprawl drive costs down.
Let us take a look at the objections mentioned earlier. The Hypervisor might consume 2-3% of CPU cycles, and much lesser fractions of other resources; it recovers far larger amounts of resources that are wasted in single purpose clusters. Interaction from other workloads can be minimized by policy based resource allocation. Moving to a Cloud does not reduce a Hadoop administrator’s role; on the contrary, it increases the administrator’s speed and effectiveness with setting up and managing workloads. Based on these, it seems quite compelling to use Private Clouds for Big Data Workloads.