THE HIDDEN COSTS OF CLUSTERS
Global competition puts tremendous pressure on companies, large and small, to innovate and protect their market share. A new set of world-class, low-cost global competitors are compressing product life-cycles. Furthermore, consumers are demanding customized products tailored to their specific tastes and sensibilities. Together with increased product complexity, outsourcing, and new regulations, the demands on new product development has never been higher. This change in business environment forces companies to be more efficient. Business leaders such as GE, Boeing, Citibank, and Wal-Mart are using high-performance computing to attain new levels of productivity. The problem is that sequential processor performance has flattened out and computer clusters have actually increased the cost of computing counter to Moore's law, which has been the basis of economic growth in the computer industry for thirty years.
TOTAL COST OF OWNERSHIP
The commodity-based cluster has dramatically changed the HPC landscape. Clusters have disrupted many sectors of the IT market. Cluster vendors promised "faster, better, cheaper" computing. Marketing numbers from IDC seem to support the perception that clusters are delivering on their promise. However, a closer look reveals that, in reality, the delivery of some of the "cheaper" promise is achieved by shifting certain costs from the traditional computer vendor to the customer. Purchasing an HPC cluster is analogous to buying low-cost, self assembled furniture. The pile of flat-packed boxes that you take home is often a far cry from the professionally assembled model in the showroom. You have saved money, but you will be spending time deciphering instructions, hunting down tools, and undoing missteps before you can enjoy your new furniture. Similarly, setting up a cluster and making it productive uncovers many hidden costs that significantly increase the total cost of ownership.
Cluster purchases are often optimized by price; acquire the most raw hardware for the lowest cost. On paper, such procurements seem impressive, however, integration and associated infrastructure costs often escape accounting. These costs frequently increase the total cost of ownership beyond user expectations and budgets. Another misconception, one that extends far beyond HPC clusters, is the notion that openly available software is free and therefore adds no cost to a cluster. While the initial cost of open software might be non-existent, there is a substantial cost associated with software support and integration. In the case of HPC clusters, these costs are, in essence, the responsibility of the customer. Support and infrastructure costs can range from small to substantial outlays depending on the user’s goal. In general, as the number of users increases, the amount of work that they must shoulder increases.
Hidden costs for a cluster can be broken down into five categories:
- Integration: Because the hardware and software components of a cluster come from many different sources, the user is responsible for integration costs. These costs can be substantial and can create high maintenance costs if care is not taken when components are integrated. If not done carefully, the installation of additional MPI libraries or compilers can require custom scripts and settings that are not portable and that can be lost in upgrade procedures. Also, adding and integrating scheduling, profiling, and debugging tools can be tedious and error-prone.
- Validation and Optimization: The cost to validate and optimize the hardware and software is particularly well hidden from the customer. Since there is no single point of contact for the entire cluster system, the user must verify that everything works as expected. In some cases, integrators run system wide tests, but the wide array of hardware and software choices pushes the ultimate responsibility for correct operation onto the customer. A change or upgrade can introduce or expose severe bugs in the application software, or fail entirely because the solution was not, or could not be validated before implementation. This process takes time and should be performed each time a significant change is made to the system.
- Maintenance: Keeping a cluster running is time and resource consuming. Some false comfort can be gained by purchasing a hardware maintenance agreement from the vendor. Although they will repair obvious problems (disk drive and power supply failures, for example), they often must defer to another vendor or software project for non-obvious failures such as poor application performance. In these latter cases, the user must invest the time to identify and assign responsibility to a specific vendor and might need to act as negotiator between two vendors.
- Upgrades: Of course, software changes and upgrades can provide better security, more features, and better performance. In many clusters, the software stack comes from many sources and there is often an unknown dependency tree within the installed software. For instance, upgrading to a new distribution of Linux might require rebuilding MPI libraries and other middleware. User applications might also need to be rebuilt with a third-party optimizing compiler that does not yet support the new distribution upgrade. Administrators and users are then required to determine workarounds or fixes that let the users run the new software. Other packages might suffer a similar fate—resulting in lost time and frustration.
- Infrastructure: In addition to the hidden support costs, commodity-based clusters also place a burden on infrastructure costs. The power and cooling costs for a cluster are often not factored into the price-to-performance numbers. The average dual-core dual-socket cluster node currently requires around 300 watts of power. Cooling and power delivery inefficiencies can double this node-power requirement to 600 watts. Therefore, on an annual basis, a single cluster node can require 5 megawatt hours. At a nominal cost of $.10 per kilowatt hour, the annual power and cooling costs for a single cluster node is approximately $500. These numbers are more striking when the power and cooling cost of the entire cluster is taken into account. For example, the power and cooling budget for a typical 256-node cluster would add up to more than $128,000 per year. Although costs can vary due to market conditions and location, the above analysis illustrates that the three-year power cost can easily reach 30–50% of the hardware purchase price for a typical commodity cluster. Other infrastructure issues can affect cost as well. A typical industrial rack-mount chassis can hold 42 cluster nodes. An average cluster node weighs approximately 45 pounds. Thus, each rack requires a surface capable of supporting 2000 pounds in the space of a single rack-mount enclosure. In a typical data center, rack-mount hardware is a mix of storage devices and servers with many under-populated racks. HPC clusters, on the other hand, represent the most dense and heavy load in the data center. In a 128-node example, the cluster would require support for 6000 pounds in a 4 x 8 foot area.
The hidden costs described earlier must be resolved before any real production computing can begin. Instead of a domain expert running code on a high performance computer with a well-defined software and hardware environment, the administrator must understand the details previously handled by the vendor. The initial cost of clusters is lower because the cost of engineering and integration has shifted from the vendor to the user. The actual cost of clusters is higher—a direct consequence of time invested in software and cluster management and integration.
|