Remix Theory » Archivio » Project : symbiose

Project : symbiose

in Science, Software | Thursday, March 22nd, 2007 | 0 Commenti » Trackback

Text source: Inria

Section: New Results

Parallelism and optimization

Participants : Rumen Andonov, Dominique Lavenier, Mathieu Giraud, Hugues Leroy, StÃ©phane Rubini, Pierre PeterLongo, Gilles Georges, Nicolas Yanev, Guillaume Collet, Mai Fei.

The parallelism axis mainly focuses on two activities:

the design of specialized parallel machines for scanning genomic banks in relation with axis 6.1;
the modelling and parallelization of optimization problems.

Specialized architectures for scanning and processing genomic banks

Participants : Mathieu Giraud, Dominique Lavenier, StÃ©phane Rubini, Philippe Veber, Gilles Georges.

Blast [41], [42] has steadily become the reference software for exploring genomic banks. Large databases can be quickly and easily screened to detect similarity with a query sequence. This type of algorithm, and many other algorithms such as patternhunter [95] or chaos [54], proceed in two steps: first they seek for anchors, then they extend them into alignments. The load balancing between this two tasks depends on the quality of the anchors. Since the alignment extension can be time consuming, the goal is to limit the number of hits by providing anchors of good quality.

More generally, the problem of mining genomic banks is either bounded by the data access (the time for scanning all the bank) or the computation time (the time to detect good anchors). We address this problem following two complementary ways: (1) speeding-up the anchor detection using reconfigurable hardware; (2) speeding-up the data access using parallel disk architectures and indexing techniques. We have developing two hardware prototypes: the RDISK system and the ReMiX systems. Both are parallel and reconfigurable systems. RDisk is developed since 2001, and ReMiX since September 2003.

With the increasing amount of available complete genomes, the need for inter or intra genome comparison is now a reality. However processing such a volume of data is extremely time consumming, and supercomputer manufacturers now propose to include accelerator boards in their machines. Through the ANR PARA project, in cooperation with the BULL R&D team (Les Clayes sous bois), we are currently designing a reconfigurable accelerator tightly interconnected to their system.

RDISK project: filtering genomic banks with reconfigurable disks

The central idea of the RDISK project is to directly filter genomic data at the disk output, in order to provide the host computer with only relevant data. The challenge is to process data at the output rate of the disk and to forward only a low percentage of the database together with anchoring informations. Previous attempts are motivated by a major trend: hard disk controllers are designed with an increasing amount of general purpose processing power and on-chip memory and filtering the data by pushing computation closer to the storage system is becoming an attractive solution for providing reduction in data movement through the I/O system.

Instead of an embedded processor we propose to connect a reconfigurable system based on a low cost FPGA component to the hard disk. The main advantage is that the anchoring-search algorithm can be highly parallelized on simple hardware structures [77], allowing on-the-fly filtering of the genomic data.

Another point to consider is the time for accessing the genomic data. The quantity of data transmitted to the processor is expected to be low and it is likely to have no data to process. The complete system is thus made of a front-end computer connected to a bunch of hard disks ( a 48-node system) coupled to reconfigurable processing and interconnected through an Ethernet local network. Depending of the type of query, an adequate hardware filter is first downloaded to the FPGA component before scanning the banks. The filtering occurs locally and results are send back to the front-end computer for further post-processing. As an example, when performing complex motif extraction, the RDISK system has shown performances equivalent to a 192 PC cluster [73], [79].

Since 2005, the prototype is fully operational. An effort has been made to make it available to the scientific community. More precisely, we have implemented a complex motif search service (WAPAM) based on weighted automaton. This service is now available through the Ouest-Genopole bioinformatics platform [105], [74].

ReMiX project: Reconfigurable memory for indexing huge volume of data

Compared to the previous project, the ReMiX project goes one step further by addressing the data access problem. The idea, here, is not to duplicate disk accesses, but to propose a hardware mechanism allowing fast random accesses to Gbytes of data. In that way, indexing techniques accessing only a fraction of the bank become highly efficient.

In the ReMiX architecture, hard drives are replaced by FLASH memories whose access time are 2 or 3 orders of magnitude shorter. In the same way, data bandwidth is increased by accessing simultaneously a large number of FLASH memories. As in the RDISK project, data are processed on-the-fly by reconfigurable hardware directly connected to the memory.

Note that the reconfigurable index memory does not fit in the addressing space of the processor but it is indirectly accessed by specific queries. The reconfigurable index memory does not hold any cache hierarchy, and therefore memory accesses do not have to worry about the data locality. In 2005, we have assembled and tested the ReMIX prototype. It is composed of a small cluster of five PCs. One acts as a front-end machine and the four others are the processing nodes , each one housing two PCI boards of 64 Gbytes of memory. The whole system hold 512 Gbytes of FLASH memory.

In 2006, two genomic applications have been implemented, illustrating the potential of the ReMIX concept. The first one deals with the search into large DNA banks. In that case, the FLASH memory contains the full bank index, allowing fast retreival compared to traditionnal approaches (speed up ranging from 20 to 50 has been measured). The second application is related to intensive comparison of a large set of proteins against the human genome. The two data set are fully indexed and compared using the FLASH memory as a temporary storage support. Again, high performances have been exhibited: plugging only one ReMIX board on a standard PC decreases the computation time by about 50.

High Performance Reconfigurable Supercomputing

For the last two or three years, processor performance growth have been limited due to the difficulty of steadily increasing the clock frequency. One response to continue to provide computer power has been the launching of dual core chips, and probably, in a very next future, quad and octo core processors. Another alternative is to enhance the traditionnal processing units with reconfigurable resources able to customize specific treatments. FPGA component offer today consequent processing power, but one of the challenge is to have these resources fed from the main memory at a very high speed. The PCI express interface, and especially the second generation, can achieved this goal by providing an agregated bandwidth of 10 Gbytes/sec.

In cooperation with the BULL R&D team (Les Clayes sous bois) we are currently designing a reconfigurable accelerator focussing on I/O performances. A genomic application, well suited for FPGA implementation, will serve as a benchmark for validating the concept.

Combinatorial optimization approach for solving protein threading problem

Participants : Rumen Andonov, Dominique Lavenier, Hugues Leroy, Nicola Yanev.

Protein folding is one of the most extensively studied problems in computational biology. The problem can be simply stated as follows: given a protein sequence, which is a string over the 20-letter amino acid alphabet, determine the positions of each amino acid atom when the protein assumes its 3D folded shape. In case of remote homologues, one of the most promising approaches is protein threading, i.e., one tries to align a query protein sequence with a set of 3D structures to check whether the sequence might be compatible with one of the structures.

We can summarize our contributions in this theme as follows. PTP has been shown to be equivalent to finding the augmented minimal path in a graph with a particular topology associated to any 3D protein structure. Several mathematical formulations for this problem have been proposed in terms of mixed integer programming models (MIP) [43]. These models were solved by the package CPLEX of ILOG and very interesting properties have been observed. The most amazing observation is that for almost all instances (more than 95% and even for polytopes with more than 10⁴⁶ vertices), the LP relaxation of the MIP models is integer-valued, thus providing optimal threading. Moreover, when the LP relaxation is not integer, its value is a relatively good approximation of the integer solution. Our approach was proven to be significantly faster than the popular in the literature B&B approach for solving PTP [92]. Moreover, the integer programming model has been proven faster than the MIP model used in the package RAPTOR [124] which was well ranked among all non-meta servers in CAFASP3 and in CASP6 (Critical Assessment of Structure Prediction).

A first direction of improvement was oriented on parallelizing the software FROST (Fold Recognition-Oriented Search Tool) which was developed few years ago by our partners from MIG, Jouy en Josas. FROST uses a database of about 1200 known 3D structures. Computing the associated distributions of scores used to take about 40 days on a 2.4Â GHz computer. On a cluster of 12 PCs, computing the score distributions takes now about three days which represents a parallelization efficiency of about 1 [103].

A second direction of improvement focused on accelerating the resolution of the PTP underlying optimization solver. The advantage of MIP models is that their LP relaxations give the optimal solution for most of the real-life instances. Their drawback is their huge size (both number of variables and number of constraints) which makes even solving the LP relaxation slow. Instead of solving them by general-purpose B&B algorithms using LP relaxation, one can design more efficient special-purpose algorithms. based on the specific properties of the PTP problem: we have proposed two Lagrangian approaches, Lagrangian relaxation and cost splitting. These approaches are more powerful than the general integer programming and allow to solve huge instances(Solution space size of 10⁴⁰ corresponds to a MIP model with 4Ã—10⁴ constraints and 2Ã—10⁶ variables [126].), with solution space of size up to 10⁷⁷ , within a few minutes. ork [25].

The above presented results confirm that integer programming approach is well suited to solve the protein threading problem. These results concern the global alignment of protein sequence and structure template. But the methods that have been developed can be adapted to other classes of matching problems arising in computational biology. Examples of such classes are semi-global alignment, where the structure is aligned to a part of the sequence (the case of multi-domain proteins), or local alignment, where a part of the structure is aligned to a part of the sequence. Problems of structure-structure comparison, for example contact map overlap, are also matching problems that can be treated with similar techniques. Solving these problems by Lagrangian approaches is work in progress.

Comparative genomics of bacteria using LR-PCR

Participants : Rumen Andonov, Dominique Lavenier, Philippe Veber, Nicola Yanev.

Comparative genomics aims to study genome variations between different species or different versions of the same organism. Here, we consider various strains of the pathogenic Gram positive bacteria Staphylococcus aureus.

A practical way to carry out genome plasticity analysis of bacteria â€“ without a systematic sequencing â€“ is to exploit the LR-PCR (Long Range Polymerase Chain Reaction) technique. The idea is to split the genomes of different strains into a large number of short segments, then to perform a LR-PCR on each segment. Depending on the reorganization, the deletion or the insertion of certain genomic zones, it is expected that a few segments will not be amplified by the LR-PCR. Thus a profile corresponding to the amplified segments will be assigned to each bacterium strain.

The goal is to cover the genome of a reference strain with overlapping segments of nearly identical size, constrained by starting and ending-primers. Primers are short synthetic oligonucleotides that have to respect certain constraints (no short palindromes, good balance between AT and CG nucleotides , etc. Practically, the bacterium genome is split into a few number of linear segments, called domains [49]. The problem of segmenting a complete bacterial genome is reduced to split each domain into segments of nearly identical size. Along a domain, there are specific positions (i.e. small 25 DNA character string) corresponding to all possible primer sites. The overlapping segments can only start and end at these positions. If we assume that a solution is made of a list of N segments, and that each segment can take only P different positions, then the number of possibilities equals P^N (N>100 in practice). We have explored various approaches for solving this problem by dedicated graph algorithms (see [44] for details), allowing a short computation time (1-2 minutes). Implementation of two algorithms have been performed and packaged into the GenoFrag software (see Section 5.2).

In 2006, the GenoFrag package has been enhanced with a new software (IThOS) for better selecting the primers. IThOS combines sequence indexing and thermodynamic evaluation to optimise oligonucleotide specificity. The whole genome sequence is first indexed using 6-base word. Then secondary hybridisation sites are searched for each primer candidate. For each position where a primer might hybridise, the thermodynamic stability of the duplex is calculated using the Nearest Neighbour Model and cut-off G^o values are then applied to select specific oligonucleotides. IThOS was found quicker and more sensitive than a BLASTn filter when tested for primers design dedicated to WGPS on a 0.5 Mb chromosomal portion of Staphylococcus aureus, a low G+C bacterium.

Lascia un commento

You must be logged in to post a comment.

Project : symbiose

Specialized architectures for scanning and processing genomic banks

RDISK project: filtering genomic banks with reconfigurable disks

ReMiX project: Reconfigurable memory for indexing huge volume of data

High Performance Reconfigurable Supercomputing

Combinatorial optimization approach for solving protein threading problem

Comparative genomics of bacteria using LR-PCR

Lascia un commento

Current Projects

Areas of Interest

_Meta-Searchers

Relevant Sites

RSS

Books

Archives

Meta