Software Helping Software

By Matthew Dublin

Researchers at the Karlsruhe Institute of Technology have developed a way to cut down on the trial-and-error process of developing scientific software.

Developers at KIT are now using the PALLADIO simulation software to analyze their code and discover problems early on, instead of wasting time, money, and energy testing flawed code on a local cluster or in the cloud.

Named after the Renaissance era architect Andrea Palladio, the PALLADIO simulation software analyzes software architecture to find non-functional properties such as performance, reliability, maintainability, and costs. This analysis also includes an evaluation of workflows in the components and subcomponents, scalability, use of resources, and distribution aspects of the software are disclosed and the complete layout of the software is checked before "building" is started.

"In the beginning was our observation that software developers apply a trial-and-error process. This is a rather inefficient method to produce error-free software," says Ralf Reussner, chair of Software Design and Quality at the Karlsruhe Institute of Technology (KIT), Germany. "If you want to build a bridge, you do not simply place a stone on top of a stone, let a truck drive across, and hope that the bridge will survive the load."

At the moment, Reussner and his collaborators are preparing PALLADIO for simulating the integration of existing software into the cloud.

PALLDIO is available through a tech-transfer effort that began in 2003 that began in 2003 as a research project of the University of Oldenburg and nowadays is a tool-supported software architecture simulation approach which has been successfully applied in industry scenarios and science. It is actively developed by Karlsruhe Institute of Technology (KIT), FZI Research Center for Information Technology, and University of Paderborn.

Google's GDrive Release Imminent

By Matthew Dublin

Thanks to a quick-thinking social media consultant with a camera phone, the cat was officially out of the bag last September regarding Google's mythical "GDrive" (Google Drive) now allegedly just called "Drive."

Johannes Wigand took a picture of what was assumed to be Drive (codenamed "Platypus") at a Google-sponsored event and posted it on his blog. It turned out that the picture was indeed of Drive and Google employees have been using the virtual disk drive storage service internally for a few years now. Actually, people have been murmuring about Drive's existence since 2006, so it appears that Google really wanted to make sure they got the kinks out before competing with the likes of Dropbox in this space.

The rumor is that Google is getting ready to roll out Drive after a prolonged period in beta mode behind Google's wall — apparently Drive was very buggy until recently. Right now, folks are speculating that this new storage solution will be offered in the same way that Gmail account holders are offered a certain ever-increasing amount of storage space for their email and docs, but with the option to buy more as needed.

The Future of Supercomputing Software Libraries

By Matthew Dublin

In this video, D.K. Panda from Ohio State University presents his talk on the future of supercomputers software libraries.

 The video was recorded on Feb 7th at the recent HPC Advisory Council Israel Supercomputing Conference.

In the video, Panda discusses the emergence of exaflop-scale computing, trends in commodity computing clusters, and the challenges associated with scaling software to billions of processors to meet the demands of exascale computing.

DOE Study Says Clouds Can't Replace Supercomputers

By Matthew Dublin

The verdict is in: cloud computing should not replace supercomputers for scientific research. Such was the result of a two-year study conduct by the US Department of Energy on the feasibility of cloud computing for meeting the computational demands of big-data research projects.

The 169-page report says that while commercial clouds might be fine for enterprise applications, big data research require more "care and feeding" — in other words, the marketing pitch of the cloud as a plug and play compute solution does not really hold water.

The DOE team, comprised of the Argonne National Laboratory in Illinois and Lawrence Berkeley National Laboratory in California, executed a range of scientific computing projects on Magellan, a testbed for cloud computing with server farms located at the National Energy Research Scientific Computing Center and the Argonne Leadership Computing Facility, as well as commercial clouds such as Amazon's EC2. The performance, costs, and manageability of the clouds were then compared to a Cray XT4 supercomputer and a Dell cluster system.

“Our analysis shows that DOE centers are often three to four times less expensive than typical commercial offerings,” the authors write in the report. “These cost factors include only the basic, standard services provided by commercial cloud computing, and do not take into consideration the additional services such as user support and training that are provided at supercomputing centers today and are essential for scientific users who deal with complex software stacks and require help with optimizing their codes.”

The study reached the following conclusions:

Scientific applications have special requirements that require cloud solutions that are tailored to these needs.

The scientific applications currently best suited for clouds are those with minimal communication and I/O (input/output).

Clouds can require significant programming and system administration support.

Significant gaps and challenges exist in current open-source virtualized cloud software stacks for production science use.

Clouds expose a different risk model, requiring different security practices and policies.

The MapReduce programming model shows promise in addressing scientific needs, but current implementations have gaps and challenges.

Public clouds can be more expensive than in-house large systems. Many of the cost benefits from clouds result from the increased consolidation and higher average utilization.

DOE supercomputing centers already achieve energy efficiency levels comparable to commercial cloud centers.

Cloud is a business model and can be applied at DOE supercomputing centers.

Click here to download the study.

Amazon Cuts Cloud Storage Prices

By Matthew Dublin

Amazon Web Services have announced their latest price reduction, this time for storage. Amazon S3 standard storage customers will see the most benefit from these price cuts. For example, if you're storing 50 terabytes of data you can expect a 12 percent reduction in cost, or if you have 500 terabytes of data you will now see a 13.5 percent savings in cost.

The following price reductions were made effective February 1, 2012.

These savings are a direct result of the continued growth of S3, which by the end of 2011 hosted roughly 762 billion objects. At peak times, S3 processes 500,000 request pre second. Since 2006, the total number of objects stored on S3 has grown by 192 percent, with last year experiencing the most significant growth.

Cleaning up Messy Data with Google Refine

By Matthew Dublin

Rod Page over at iPhylo has a post describing how useful Google Refine is for cleaning up taxonomic databases. Google Refine, formerly known as Freebase Gridworks, is a freely available web-based "power tool" that supports TSV, CSV, Excel, and XML file formats. Among other features, Google Refine allows users to pull together disparate data sets and work with the data in a collated, polished fashion.

Page, a professor of evolutionary biology at the University of Glasgow, is a big fan of Google Refine's "Reconciliation Services," which he uses for matching names to external identifiers.

So far, Page has used Google Refine with EOL, NCBI taxonomy, uBio , WORMS, and GBIF.

Here's an introduction to Google Refine:

Fighting Disease with iPhones and Big Data

By Matthew Dublin

A startup iPhone app developer based in Bucharest, Romania, called Skin Scan has big plans to fight and track skin cancer. Skin Scan's app (also called Skin Scan) allows users to snap pictures of questionable moles or lesions which are then sent to Skin Scan's servers where a proprietary algorithm analyzes the picture. While the app will not provide an accurate diagnosis — yet — the algorithm will identify abnormalities and assign a rating for the abnormality from low-risk to high-risk and then refers users to local dermatologists.

Skin Scan is building an analytic database based on photographs and results from user, including location data in order to create a time-space map model based on the severity and frequency of lesions.

As skin cancer is best analyzed over time, this data may be useful to not only physicians, but government and academic researchers tracking cancer as well, assuming it can be sufficiently de-identified.

The app developer also has designs on connecting doctors and users to eliminate in-person office visits.

In discussions of personalized medicine, the concept that someday soon patients might walk around with their genomes in their pockets or on mobile devices is often batted around but the viability or execution is rarely explored. Technology developments such as Skin Scan could prove to be a good test case for connecting patients with physicians with personalized medical data in a way that integrates instantaneous communication and real-time data analysis with consumer electronic devices.

Cray Now Offering $200,000 Supercomputer

By Matthew Dublin

In effort to reach out to researchers with limited funding and a desire to own their own supercomputer, Cray is now offering a line of commodity supercomputers with a starting price tag of $200,000.

Cray's entry-level offering combines the software support previously only reserved for Cray CX1 and Cray CX1000 systems with the petascale capabilities of the Cray XE6m and Cray XK6m line. The $200,000 system also comes equipped with Cray's Gemini interconnect, the latest version of the Cray Linux Environment, powerful AMD Opteron 6200 Series processors, and GPUs.

"Cray's new entry-level configurations leverage its deep HPC technology portfolio to create purpose-built systems for the departmental technical computing market segment," said Earl Joseph, IDC program vice president for HPC. "This segment was worth around $3 billion in 2011 and IDC projects that it will grow at a healthy 7 percent to 8 percent CAGR through 2015."

The new "affordable" supercomputer is not really a full-fledged supercomputer per say but rather a blade server configuration that's essentially a baby XE6m configuration with six blades and 49 sockets using Opteron 6200s. The server rack is capable of 6.5 teraflops — which comes out to about $30,769 per teraflop.

These new entry-level supercomputers might be the perfect solution for researchers interested in developing code for larger-scale systems, such as Blue Waters at the National Center for Supercomputing Applications at the University of Illinois or the Titan supercomputer at Oak Ridge National Laboratory.

What it Takes to Get to Exascale

By Matthew Dublin

Science has an article discussing what it will take to make exascale computing a reality. These new systems — which at present remain only theoretically possible — would be capable of performing 10 to the 18th power floating point operations per second, or an exaflop.

Exascale supercomputers would be 100 times more powerful than today's fastest supercomputer, the K Computer at Japan's Riken institute, which is currently ranked at roughly 11.3 petaflops. All the major supercomputing powers are racing towards constructing a viable exascale system, including the US, China, Japan, Russia, India, and the EU.

However, the challenges of energy efficiency and sustained performance are formidable, not to mention developing brand new programming models for these huge systems.

Even though computer hardware has seen a steady increase in performance over the last few decades, when it come to actually achieving exascale performance, all those technological advances go out the window. Exascale won't simply be a matter of building a really, really large supercomputer center, crammed to the ceiling with the latest server blades, but rather, an entirely new processor and interconnect architecture.

Intel has released its 50-core Knights Corner and Xeon E5 server chips in an attempt to build up to exascale by the year 2018. These chips are designed for massive processor core counts as well as low energy consumption.

Sometimes the need for a completely new hardware to accommodate the perpetual growth in research data gets lost — folks still think the cloud can save them when, for example, genomics datasets reach the exascale mark. Unfortunately, an exascale cloud can't exist until there is exascale hardware to make it float.

API for Statistical Phylogenetics with HPC

By Matthew Dublin

Researchers at the University of Maryland have developed BEAGLE, an application programming interface and specialized library for high-performance statistical phylogenetic inference that allows existing software packages to make more effective use of available computer hardware including GPUs, CPUs with Streaming SIMD Extensions, and multi-core CPUs via OpenMP.

The team profiled their research in Systematic Biology and write that "a specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future."

BEAGLE is compatible with Mac, Windows, and Linux operating systems. It is freely available for download here.

GPUs for GWAS

By Matthew Dublin

Because the computational burden to search for epistasis in genome-wide association study data is often prohibitive, a team from the Roslin Institute at the University of Edinburgh has attempted a powerful and cheap implementation of a search algorithm on GPUs using OpenCL. The team published a paper in Bioinformatics describing the GPU implementation, which achieved a 92 speed up of an exhaustive epistasis scan for a quantitative phenotype.

In their paper, the authors write that "to achieve a comparable computational improvement without a graphics card would require a large compute-cluster, an option that is often financially non-viable. The implementation presented uses OpenCL—an open-source library designed to run on any commercially available GPU and on any operating system."

Their software, called EpiGPU, is open-source and GPU-vendor independent, meaning that it will run on any GPU card.

It can be downloaded here.

Amazon Rolls out NoSQL Database Service

By Matthew Dublin

Amazon Web Services (AWS) has launched a fully managed NoSQL database service in the cloud called DynamoDB that aims to provide seamless scalability on the fly. AWS is claiming that their new service will offload administrative tasks such as hardware provisioning, setup, configuration, replication, software patching, and cluster scaling.

According to their announcement, "developers can create a database table that can store and retrieve any amount of data, and serve any level of request traffic. DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent, fast performance. All data items are stored on Solid State Disks and are automatically replicated across multiple Availability Zones in a Region to provide built-in high availability and data durability."

Amazon's CTO Werner Vogels has a post on his blog discussing the announcement, where he describes DynamoDB as the result of 15 years of "learning" in the areas of large-scale non-relational databases and cloud computing. "Several years ago we published a paper on the details of Amazon’s Dynamo technology, which was one of the first non-relational databases developed at Amazon. The original Dynamo design was based on a core set of strong distributed systems principles resulting in an ultra-scalable and highly reliable database system."

With a NoSQL database there is no strict schema, so data is collapsed into one very fat table where each row stores a huge amount of data. The NoSQL database contains a lot of data redundancy, which means more storage space and computational power is required compared to SQL databases.

AWS might attract customers in genomics with this offering as there have already been several use cases of NoSQL in the cloud for omics research. For example, last October, Monsanto deployed Cloudant's NoSQL database as the foundation of their genomics data analysis system.

DynamoDB users can get started with a free tier account that enables 40 million of requests per month free of charge. Additional request capacity is priced at cost-efficiently hourly rates as low as $.01 per hour for 10 units of Write Capacity or 50 strongly consistent units of Read Capacity, with replicated solid state disk storage at $1 per GB per month.

GPU-Based Cluster Aids Nanocarrier Simulations

By Matthew Dublin

A team at the University of Illinois at Chicago are using both traditional and GPU-based clusters at the National Center for Supercomputing Application (NCSA) to study nanocarriers. Like an empty bullet casing, nanocarriers could prove to provide a targeted delivery method for drugs needed to kill cancer cells.

The NCSA's clusters enabled the researchers to perform extensive atomistic molecular dynamics simulations of polyethylene glycol (PEG)-ylated phospholipid dendron-based micelles — aggregates of surfactant molecules dispersed in a liquid colloid — in which the micelles are characterized in pure water and ionic solutions.

"Our simulations are massive," says principal investigator Petr Kral. "They have up to 750,000 atoms and they need to be calculated for a relatively long time, up to 30 nanoseconds. That is why the supercomputer was very useful to us and very necessary."

While Kral and his collaborators developed their own GPU-based computer system in their lab, it lacked the power for their simulations they run. Their results were published last year in the Journal of the American Chemical Society and Chemical Communications.