ARCHIVED: Completed project: Data Capacitor

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Primary UITS contact: Stephen Simms

Completed: September 2, 2011

Description: This project, creating a Data Capacitor and a Metadata/Web Services server, addresses two clear and widespread challenges: the need to store and manipulate large amounts of data for short periods of time (hours to several days) and the need for reliable and unambiguous publication, discovery, and utilization of data via the web. For more, see the About the Data Capacitor.

The Data Capacitor is both a system and a project, funded by the National Science Foundation with significant matching funds from Indiana University.

The Data Capacitor WAN project and system follow the success of the original Data Capacitor. The WAN file system uses the same technology, modified by Indiana University, to provide high-speed short-term storage across geographically diverse supercomputing resources with automatic user ID (UID) mapping across the TeraGrid network.

Progress and research possibilities in many disciplines have been fundamentally changed by the abundance of data now so rapidly produced by advanced digital instruments. Scientists face the present challenge of drawing out from these data the information and meaning contained within. IU has established a significant cyberinfrastructure composed of high-performance computing systems, archival storage systems, and advanced visualization systems spanning two main campuses in Indianapolis and Bloomington, and connected to national and international networks. IU enhances its infrastructure in ways that will result in qualitative changes in the research capabilities and discovery opportunities of a broad array of scientists who work with large data sets.

The Data Capacitor is a large-capacity, short-term data store with very fast I/O. The Data Capacitor has become a development platform and testbed for new cyberinfrastructure, as well as a proof of concept for large-capacity, short-term storage devices.

At SC07, the Data Capacitor was used to demonstrate fast data transfers across great distances using the Lustre file system. It is the Data Capacitor's wide-area capability that an Indiana University-led team utilized to win 2007's international bandwidth challenge competition.

In 2008, the Data Capacitor WAN (DC-WAN) system was made available nationwide to TeraGrid users. The DC-WAN storage system enables collaborative research by providing high-speed storage across the TeraGrid, and other wide area networks. The system uses the open source Lustre file system code modified by Indiana University to provide the mapping of user IDs (UIDs) across distributed clients, removing the impediment of file ownership across resources.

At the 2009 Lustre Users Group Meeting, Stephen Simms gave a talk that highlighted some of the advantages of using Lustre WAN. For example, researchers could use supercomputing resources in geographically distributed locations to create simulation data that could be visualized locally. By the end of 2009, DC-WAN had been successfully mounted at 7 TeraGrid sites and had been running production jobs at 5 of those sites.

Data Capacitor

  • Primary use: Short term storage
    • Operating system: Red Hat Enterprise Linux 5.4
    • File system: Lustre 1.8.1.1
    • Peak theoretical processing capability: 1.54 teraFLOPS
    • Achieved maximum Linpack performance: Not applicable
    • Total system RAM: 208 GB
    • Total disk storage: 535 TB
    • Total archival storage: Not applicable
    • Processor:
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Types of nodes:
      • Object-based Storage Target
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Metadata Server:
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Numbers of nodes:
      • Twenty-four Object-based Storage Servers (OSS)
      • Six Metadata Servers (MDS)
    • Internal network: One gigabit private
    • External network: One gigabit to MDS, ten gigabit to OSS
    • Connections to external network: 242 Gbit capability connected to Cisco Nexus 7018
    • Date of acquisition: Delivered May 2006; accepted October 2006
    • Accessible to: Local IU users
    • Cooling: Enclosed, self-contained water-cooled racks (from Ritall, Inc.)
    • Further information: Peak 14.5 GBps I/O rate

Data Capacitor WAN

  • Primary use: Short term storage
    • Operating system: Red Hat Enterprise Linux 5.4
    • File system: Lustre 1.8.1.1 patched with IU's UID/GID mapping code
    • Peak theoretical processing capability: 1.54 teraFLOPS
    • Achieved maximum Linpack performance: Not applicable
    • Total system RAM: 144 GB
    • Total disk storage: 361 TB
    • Total archival storage: Not applicable
    • Processor:
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Types of nodes:
      • Object-based Storage Target
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Metadata Server:
      • Dual-core 3.0 GHz Xeon processor
      • Four floating-point operations per clock cycle per core
    • Numbers of nodes:
      • Four Object-based Storage Servers (OSS)
      • Two Metadata Servers (MDS)
    • Internal network: One gigabit private
    • External network: One gigabit to MDS, ten gigabit to OSS
    • Connections to external network: 42 Gbit capability connected to Cisco Nexus 7018
    • Date of acquisition: Spring 2008
    • Accessible to: Local IU users and TeraGrid users nationwide

Outcome and benefits: Progress and research possibilities in many disciplines have been fundamentally changed by the abundance of data now so rapidly produced by advanced digital instruments. A critical challenge facing scientists is to draw out from these data the information and meaning that they contain. The Data Capacitor provides researchers with a 535 TB file system to temporarily store and manipulate large data sets. Because the file system can be mounted in multiple places, it is possible for the Data Capacitor to play a role in every step of the data lifecycle, from acquisition or creation, through computation and visualization, to archive storage. Because of its size, the Data Capacitor will help even out mismatches between the rate of data production and the rate of data analysis, much the way a capacitor evens the flow of electrons in a circuit. Because of its aggregate 14.5 GBps write rate, the Data Capacitor can keep up with even the most tenacious data firehose.

The success of the original Data Capacitor project and systems was followed in 2008 by the Data Capacitor WAN project, which further enables research across geographically diverse computing resources by providing a consistent high-speed file system with automatic user ID (UID) mapping.

Client impact: In addition to the Co-PIs and SIs listed below, this project benefits faculty members at IU involved in the following projects:

  • Center for Genomics and Bioinformatics

    The Center for Genomics and Bioinformatics is a multidisciplinary research center serving the IUB campus. The CGB carries out independent research in genomics and bioinformatics, collaborates with and/or assists projects developed by IUB faculty, and promotes interdepartmental and interdisciplinary interactions to enhance genomics and bioinformatics at IUB.

  • Computational Chemistry

    James P. Reilly's laboratory focuses on research in efficient biomolecular ion production, proteomics, photochemistry of peptide ions, protein structure and cellular fingerprinting, and novel time-of-flight instrumentation.

  • Computational Fluid Dynamics Laboratory

    The Computational Fluid Dynamics Laboratory was established in 1986 within the Department of Mechanical Engineering to conduct research and develop software in the areas of computational fluid dynamics and heat transfer. Current research projects include the finite element and finite volume solution of three-dimensional flow problems; high speed compressible flow calculations for internal and external flows; unsteady flow computations; moving body flows with unstructured meshes; parallel computing; load balancing for parallel computing on parallel processors and network of workstations; and high-performance grid computing.

  • Internet Traffic Analysis

    This project studies the infrastructure scalability and vulnerabilities of expanding communication networks, by means of analyzing the statistical behavioral patterns that emerge and are observable in Internet traffic data. The idea is that such analysis may lead to robust design/planning/management tools as well as methods for mitigating and/or immunizing against attacks by early detection of anomalous patterns correlated with malicious behavior. The networks considered span a very broad range of scale, from individual interactions (e.g., social engineering, phishing, covert communication) to application-specific flows (e.g., spam, email, and web-based DDoS) to global-scale Internet traffic networks (e.g., Internet2 peer networks and worms).

  • Linked Environments for Atmospheric Discovery

    Linked Environments for Atmospheric Discovery (LEAD) makes meteorological data, forecast models, and analysis and visualization tools available to anyone who wants to interactively explore the weather as it evolves. The LEAD Portal brings together all the necessary resources at one convenient access point, supported by high-performance computing systems. With LEAD, meteorologists, researchers, educators, and students are no longer passive bystanders or limited to static data or pre-generated images, but rather they are active participants who can acquire and process their own data.

  • Platform for Computational Comparative Genomics on the Web

    PLATCOM is an integrated system for the comparative analysis of multiple genomes. It is designed in a modular way, so that multiple tools and databases can be integrated freely and the whole system can grow easily. The PLATCOM system is built on internal databases, which consist of GenBank, Swiss-Prot, COG, KEGG, and Pairwise Comparison Database (PCDB). PCDB is a derived database from GenBank built by performing pairwise comparison of protein-to-protein and whole genome-to-whole genome sequences with FASTA and BLASTZ, respectively. Currently it contains 48,205 entries of unduplicated protein-to-protein and whole genome-to-whole genome pairwise comparison matches. PCDB is designed to incorporate newer genomes automatically, so that PLATCOM evolves as new genomes become available. Over these databases, a suite of genome analysis applications is provided.

  • Polar Grid

    Polar Grid is an NSF MRI-funded partnership of Indiana University and Elizabeth City State University to acquire and deploy the computing infrastructure needed to investigate the urgent problems in glacial melting.

  • Proteomics at IU

    The Proteomics Core Facility at the IU School of Medicine opened in the fall of 2001 in the Department of Biochemistry and Molecular Biology. It is a component of the INGEN cores supported by Indiana Genomics Initiative (INGEN). The Proteomics Core Facility became the academic component of the Indiana Centers for Applied Protein Sciences (INCAPS) in May 2004 and was renamed the Protein Analysis and Research Center. It is a service and collaborative research resource that balances applied proteomics research with the development of new and improved methods for protein identification, characterization, and quantification. The Center encourages collaborations that apply the tools of proteomics to cutting-edge biomedical research. For more, see the article Honing the Proteome in Research & Creative Activity.

  • WIYN Observatory

    The WIYN Telescope, a 3.5-meter instrument employing many technological breakthroughs, is the newest and second largest telescope on Kitt Peak. The WIYN Observatory (pronounced "win") is owned and operated by the WIYN Consortium, which consists of the University of Wisconsin, IU, Yale University, and the National Optical Astronomy Observatories (NOAO). Most of the capital costs of the observatory, which amounted to $14 million, were provided by these universities, while NOAO, which operates the other telescopes of the Kitt Peak National Observatory, provides most of the operating services. This partnership between public and private universities and NOAO is the first of its kind. The universities benefit from access to a well-run observatory at an excellent site, and the larger astronomical community served by NOAO benefits from the addition of this large, state-of-the-art telescope to Kitt Peak's array of telescopes.

  • X-ray Crystallography

    The Indiana University Molecular Structure Center (IUMSC) is a service and research facility in the Department of Chemistry at IUB. The laboratory has a full complement of single crystal and powder diffraction equipment used to characterize crystalline materials using the techniques of X-ray crystallography. Researchers in the laboratory can determine the three-dimensional structure of nearly any material that can be crystallized. A crystallographic study produces a set of atomic coordinates that locate the atoms of a molecule in the "unit cell" of the crystal. This information can then be used to generate images of the molecule and to determine distances and angles in the molecule. In addition, the data allows one to examine the packing of the molecules in the crystal, information which can often lead to understanding the properties of the material. IUMSC Server allows rapid access to the data generated in the IUMSC. Nearly all of the materials studied have been synthesized or isolated by researchers from other laboratories, usually within the IU system, but often from laboratories throughout the world.

Project sponsor: Craig Stewart, Associate Dean for Research Technologies

Project team:

  • Stephen Simms
  • Joshua Walgenbach
  • Justin Miller
  • Nathan Heald
  • Eric Isaacson

Additional information

  • PI: Craig Stewart
  • Co-PIs: Randall Bramley, Catherine A. Pilachowski, Beth Plale, Stephen Simms
  • SIs: P. Cherbas, S. Chien, D. Clemmer, M. Davy, A. Dzierba, G.C. Fox, K. Kallback-Rose, M. Gupta, D. Hart, K. Honeycutt, J. Huang, J. Huffman, S. Kim, A. Lumsdaine, F. Menczer, S. Mooney, M. Palakal, J. Paolillo, P. Radivojac, J. Reilly, H. Tang, E. Wernert, B. Wheeler, D. Durisen, H. Cohn, R. Payli
  • Funding agency and grant number: NSF CNS0521433
  • Grant dates: October 1, 2005-October 1, 2008
  • Funding to UITS: $1,720,000
  • Total funding to IU related to this project: $1,720,000

This is document avku in the Knowledge Base.
Last modified on 2018-01-18 15:44:51.