ARCHIVED: Completed project: Multicore computing

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Primary UITS contact: Judy Qiu

Completed: August 9, 2011

Description: A multicore CPU combines two or more cores (independent microprocessors) on a single board. Currently only dual-core and quad-core models are commonly available from vendors. However, Intel, for example, has an 80-core prototype. Moreover, according to a respected white paper (The Landscape of Parallel Computing Research: A View From Berkeley), in the future CPUs will have hundreds or thousands of cores. This will increase computing power for both research and commercial applications but also undoubtedly will present significant software challenges for parallel applications. In collaboration with the Community Grids Lab at IU, we are researching multicore technologies.

General goals:

Develop, demonstrate, and test a hybrid model of parallel computing, involving workflow and mash-ups linking high-performance parallel modules implemented on multicore clusters
Implement and evaluate the performance of clusters of multicore systems
Develop a suite of parallel data mining algorithms for applications including GIS, cheminformatics, bioinformatics, speech recognition, and image processing for polar science

Current status:

We have defined a general parallel algorithm covering clustering and mixture models with built-in annealing to improve convergence.
We have developed parallel methods for mapping high-dimensional spaces to a smaller number of dimensions for easier visualization and analytic processing. We are comparing principal component analysis (PCA), generative topographic mapping (GTM), and multidimensional scaling (MDS). We intend to add Bourgain's random projection method.
We have measured and understood basic performance of two-processor systems with a total of eight cores.
We have successfully tested methods on GIS and cheminformatics applications using a hybrid software model based on Microsoft CCR and DSS. We currently are looking at a bioinformatics clustering problem. We support several types of variables, including real-valued, binary, and profile representation used in bioinformatics.
We have extended our work to a total of 32 cores on multicore clusters. We expect to further expand our test platforms to a total of 512 cores using a combination of different paradigms, from low-level technologies like MPI and CCR to new workflow and Internet computing approaches including DSS, Google's MapReduce, and Yahoo's Hadoop.
We are investigating overheads coming from communication, the programming paradigm, and the use of virtual machines.
With this enhanced computing resource, we will tackle major new applications, including the search for gene families in a collection of a million sequences. Early results suggest that our new annealing algorithms perform better than existing clustering and dimensional reduction methods.
Results have been presented worldwide at three conferences in fall 2007 and three so far in 2008, including presentations at CYFRONET AGH in Krakow, Poland, and the Many-Core Workshop in Shanghai, China. A presentation is also planned in Bloomington for October 2008 as part of the Research Technologies Roundtable series. We also plan to have an exhibition at the Fourth IEEE International eScience 2008 conference, December 7-12, 2008.
High Energy Physics data analysis is both data (petabyte) and computation intensive. We have developed a data analysis tool using DryadLINQ and its MapReduce support to analyze LHC particle physics experiment data from the Center for Advanced Computing Research at the California Institute of Technology. The tool uses DryadLINQ to distribute the data files across available computing nodes, and then execute a set of analysis scripts written in CINT (an interpreted language of the physics analysis package ROOT) on all these files in parallel. After processing each data file, the analysis scripts produce histograms of identified features, which are merged (the "Reduce" of MapReduce) to dynamically produce a final result of overall data analysis.
We are working with IU School of Medicine to relate patient records to environmental factors, and the figure shows clusters in the patient records visualized after MDS dimension reduction. This involves clustering of 160 dimensional vectors of more than 360,000 patient records. The results would contribute to identify environmental conditions of the obesity epidemic, which is a well-documented public health problem in the United States, and environmental conditions, as intervening factors through their impact on physical activity and eating habits.
Since the end of 2008, we have published two papers at the eScience 2008 conference and one paper in the book Trends in High Performance and Large Scale Computing. We have yet another invited book chapter to write in Data Intensive Distributed Computing by mid-May 2009. We presented our work at the Microsoft Research TechFest February 24, 2009, and the Microsoft External Research Symposium March 31, 2009. Our plan of activities for 2009 includes making our core parallel algorithms (e.g., vector-based deterministic annealing clustering and pairwise clustering) as services for public access, and writing grant proposals for medical and biology applications.
Between June 1 and July 31, 2008, a team of STEM summer scholars from North Carolina A&T joined Community Grids Lab and were supervised by Qiu. These three students are involved in research activities with the SALSA project. The SALSA project is funded by Microsoft research to investigate new programming models of parallel multicore computing and Cloud/Grid computing. It aims at developing and applying parallel and distributed Cyberinfrastructure to support large-scale data analysis. Students have shown motivation and great interests in their selected research topics. They chose to work on the following:
- Algorithm optimization and performance measurements of parallel pairwise clustering algorithm on multicore clusters, which includes parallel matrix multiplication and eigenvector calculation using MPI
- How to visualize and select metadata in our 3D visualization tool Plotviz
- Data mining of health data to find correlations between high-dimension environment and patient data, using the statistics tool R with the canonical correlation analysis method
All these sub-projects worked synergistically as parts of our scalable data analysis research in Cloud/Grid, biomedicine and particle physics. Students have successfully completed their research activities and had poster presentations at the closing ceremony in Indianapolis on July 31, 2009.
We developed a high-performance Windows visualization tool (Plotviz) using Microsoft's XNA platform for sets of points gotten from dimension reduction of high-dimension data sets to 3D. These are being used as Cheminformatics and Bioinformatics data set browsers.
We developed Twister with extensions of MapReduce to perform data mining efficiently. MapReduce is pioneering new approaches to data analysis with a simple programming model and quality of services. Twister extends MapReduce to give it high performance for a broad class of data mining/Machine learning applications, especially for applications with iterative MapReduce computations. Twister is incorporated with a set of MapReduce extensions that we have found useful from our experience from the SALSA HPC group's work on commercial MapReduce run times such as Microsoft Dryad and open source Hadoop. Twister architecture is based on pub/sub messaging that enables it to perform faster data transfers, minimizing the overhead of the runtime. Some of the applications we have implemented using Twister are: K-means clustering, Google's PageRank, Breadth first graph search, Matrix multiplication and Multidimensional scaling.
We've had a demo/exhibition at the IU booth for SC09 (salsahpc.indiana.edu), which showed our work in Life Sciences applications using Cloud technologies (e.g., MapReduce, Dryad, and Hadoop). We've also developed a dynamic virtual cluster that demonstrated the concept of Science on Clouds using a FutureGrid cluster.
Our research work has been included in technical reports and published in a collection of book chapters, and journal and conference papers. Particularly, we have "High Performance Parallel Computing with Clouds and Cloud Technologies" as a book chapter in Cloud Computing and Software Services: Theory and Techniques, CRC Press (Taylor and Francis), ISBN-10: 1439803153; an invited paper "Cloud Technologies for Bioinformatics Applications" for the Journal of IEEE Transactions on Parallel and Distributed Systems; papers to appear in the proceedings of the 1st international conference on Cloud Computing in Munich Germany, 5th IEEE e-Science conference in Oxford UK, 1st international conference on Cloud Computing in Beijing, ACM MATG workshop of SC09, 10th IEEE/ACM CCGrid 2010, Multicore workshop of 10th IEEE/ACM CCGrid 2010, and ICCS 2010 conference. We've also submitted two reports to Microsoft: "Performance of Windows Multicore systems on threading" and "MPI and applicability of DryadLINQ to Scientific Applications".
Between March 31 and December 31, 2009, we have given 18 presentations for conferences and workshops. These include the IEEE e-Science 2009 conference, the doctoral showcase at SC09, Using clouds for parallel computations in systems biology workshop at SC09, Many-Task Computing on Grids and Supercomputers, Microsoft eScience workshop, Open Grid Forum summit09, Indiana State IT meeting, and the NSF Data Intensive Computing workshop.