08 July 2010

Pairwise Alignments in C++ (for fun). My notebook.

With Next Generation Sequencing technologies, I have more and more opportunities to program in C++. C/C++ was my first programming language and I now enjoy my past experience with java for designing some classes, using the design patterns, etc... . Thus, I've recently played with the pairwise alignments in C++ and, in this post I'll describe the classes I've used (hey, I wrote this for fun, I didn't intend to write a formal and optimized code about this subject (I'm not a specialist of this anyway). Furthermore, I will not describe what is Dynamic Programming).

The Classes


Sequence

Sequence is an abstract sequence of characters (= string). It has a 'size' and should return the index-th character.
class Sequence
{
public:
Sequence() { }
virtual ~Sequence() { }
virtual size_type size() const=0;
virtual char at(size_type index) const=0;
}*SequencePtr;

CCSequence

CCSequence is a concrete implementation of Sequence using a std::string as the string holder.
class CCSequence:public Sequence
{
private:
std::string str;
public:
CCSequence(const std::string& str):Sequence(),str(str) { }
virtual ~CCSequence() { }
virtual size_type size() const
{
return (size_type)str.size();
}
virtual char at(size_type index) const
{
return str[index];
}
};

PtrSequence

PtrSequence is another implementation of Sequence but here, it uses a C pointer char* as the string holder.
class PtrSequence:public Sequence
{
private:
const char* str;
std::size_t length;
public:
PtrSequence(const char* str,size_t length):Sequence(),str(str),length(length) { }
PtrSequence(const char* str):str(str),length(0UL) { length=std::strlen(str); }
virtual ~PtrSequence() {}
virtual size_type size() const
{
return (size_type)this->length;
}
virtual char at(size_type index) const
{
return str[index];
}
};

PathItem

This class is a component of the pairwise alignment. It stores the positions on both sequences. A position equals to '-1' would be a gap.
typedef struct PathItem
{
Sequence::size_type x;
Sequence::size_type y;
PathItem(Sequence::size_type x,Sequence::size_type y):x(x),y(y)
{
}
Sequence::size_type get(int side) const { return (side==0?x:y);}
}*PathItemPtr;

Path

the class Path is a solution of the matrix traversal by the dynamic algorithm. It is a vector of PathItem and it also implements Sequence as the consensus of the alignment:
typedef class Path: public Sequence
{
private:
const SequencePtr seqX;
const SequencePtr seqY;
std::vector<PathItem> path;

std::auto_ptr<Sequence> _asSeq(int side) const
{
const SequencePtr seq=(side==0?seqX:seqY);
std::vector<PathItem>::const_iterator r=path.begin();
std::vector<PathItem>::const_iterator r_end=path.end();
std::ostringstream os;
while(r!=r_end)
{
Sequence::size_type n=r->get(side);
os << (n==-1?'-':seq->at(n));
++r;
}
return std::auto_ptr<Sequence>(new CCSequence(os.str()));
}
public:
Path(const SequencePtr seqX,const SequencePtr seqY,std::vector<PathItem> path):Sequence(),seqX(seqX),seqY(seqY),path(path)
{
}

virtual ~Path()
{
}

const SequencePtr X() const
{
return seqX;
}
const SequencePtr Y() const
{
return seqY;
}
std::auto_ptr<Sequence> consensusX() const
{
return _asSeq(0);
}
std::auto_ptr<Sequence> consensusY() const
{
return _asSeq(1);
}
std::auto_ptr<std::string> mid() const
{
std::vector<PathItem>::const_iterator r=path.begin();
std::vector<PathItem>::const_iterator r_end=path.end();
std::ostringstream os;
while(r!=r_end)
{
if(r->x==-1 || r->y==-1)
{
os << ' ';
}
else if(std::toupper(X()->at(r->x))==std::toupper(Y()->at(r->y)))
{
os << '|';
}
else
{
os << ' ';
}
++r;
}
return std::auto_ptr<std::string>(new std::string(os.str()));
}
virtual size_type size() const
{
return path.size();
}
virtual char at(size_type index) const
{
const PathItem& i=path.at(index);
if(i.x==-1 && i.y==-1) return '-';
if(i.x==-1) return Y()->at(i.y);
if(i.y==-1) return X()->at(i.x);
char cx= X()->at(i.x);
char cy= Y()->at(i.y);
return cx==cy?cx:'X';
}

std::ostream& print(std::ostream& out) const
{
out << *(consensusX()) << "\n"
<< *(mid()) << "\n"
<< *(consensusY()) << "\n"
;
return out;
}
}*PathPtr;

SubstitutionMatrix

SubstitutionMatrix is an abstract class. Its function compare returns the cost of the substitution of the symbol 'c1' by the symbol 'c2':
template<typename SCORE>
class SubstitutionMatrix
{
public:
virtual SCORE compare(char c1,char c2) const=0;
};

Identity

Identity is a simple implementation of SubstitutionMatrix. It returns 1 if the symbols are identical, else it returns -1.
template<typename SCORE>
class Identity:public SubstitutionMatrix<SCORE>
{
public:
virtual SCORE compare(char c1,char c2) const
{
return (SCORE)(std::toupper(c1)==std::toupper(c2)?1:-1);
}
};

Blosum

Blosum is an abstract implementation of SubstitutionMatrix for a BLOSUM matrix:
/** cf. ftp://ftp.ncbi.nih.gov/blast/matrices */
template<typename SCORE>
class Blosum:public SubstitutionMatrix<SCORE>
{
protected:
virtual int aa2index(char c) const
{
switch(std::toupper(c))
{
case 'A': return 0;
(...)
}
}
public:
virtual SCORE compare(char c1,char c2) const=0;
};

Blosum62

It is a concrete implementation of Blosum
static const signed char _blosum62[]={
4,-1,-2,-2,0,-1,-1,0,-2,-1,-1,-1,-1,-2,-1,1,0,-3,-2,0,-2,-1,0,-4,
(...)
};
/** ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 */
template<typename SCORE>
class Blosum62:public Blosum<SCORE>
{
public:
virtual SCORE compare(char c1,char c2) const
{
return (SCORE)::_blosum62[this->aa2index(c1)*24+ this->aa2index(c2)];
}
};

Array2D

This abstract class holds the matrix of type 'T' for the dynamic algorithm. For a given matrix we need to know its width, its height , getthe value at position(x,y) and we want to set the value at position(x,y).
emplate<typename T>
class Array2D
{
public:
Array2D() {}
virtual ~Array2D() {}
virtual size_type width() const=0;
virtual size_type height() const=0;
virtual T get(size_type x,size_type y) const=0;
virtual void set(size_type x,size_type y,T val)=0;
protected:
std::size_t offset(size_type x,size_type y) const
{
return (this->width())*y + x;
}
};

DefaultArray2D

DefaultArray2D is a first concrete implementation of Array2D. It holds everything in memory:
template<typename T>
class DefaultArray2D:public Array2D<T>
{
protected:
size_type _width;
size_type _height;
T* matrix;

public:
DefaultArray2D( size_type width, size_type height):_width(width),_height(height)
{
matrix=new T[width*height];
}

virtual ~DefaultArray2D()
{
delete [] matrix;
}

virtual size_type width() const
{
return this->_width;
}

virtual size_type height() const
{
return this->_height;
}

virtual T get(size_type x,size_type y) const
{
return matrix[this->offset(x,y)];
}
virtual void set(size_type x,size_type y,T val)
{
matrix[this->offset(x,y)]=val;
}
};

StoredArray2D

StoredArray2D is second lousy concrete implementation of Array2D. It stores the matrix in a temporary file:
template<typename T>
class StoredArray2D:public Array2D<T>
{
private:
size_type _width;
size_type _height;
/** FILE pointer to the binary file */
mutable FILE* io_ptr;
void move(size_type x, size_type y) const
{
if(std::fseek (io_ptr,this->offset(x,y)*sizeof(T), SEEK_SET)!=0)
{
throw std::runtime_error("Cannot fseek");
}
}
public:
StoredArray2D( size_type width, size_type height):_width(width),_height(height),io_ptr(NULL)
{
io_ptr= std::tmpfile();
if(io_ptr==NULL) throw std::runtime_error("Cannot open tmp file");
T data;
std::memset(&data,0,sizeof(T));
for(std::size_t i=0;i< (size_t)(width*height);++i)
{
if(std::fwrite((void*)&data,sizeof(T),1,io_ptr)!=1)
{
std::fclose(io_ptr);
io_ptr=NULL;
throw std::runtime_error("write matrix");
}
}
}

virtual ~StoredArray2D()
{
if(io_ptr!=NULL) std::fclose(io_ptr);
}

virtual size_type width() const
{
return this->_width;
}

virtual size_type height() const
{
return this->_height;
}

virtual T get(size_type x,size_type y) const
{
T data;
move(x,y);
if(std::fread((void*)&data,sizeof(T),1,io_ptr)!=1)
{
throw std::runtime_error("cannot read");
}
return data;
}
virtual void set(size_type x,size_type y,T data)
{
move(x,y);
if(std::fwrite((void*)&data,sizeof(T),1,io_ptr)!=1)
{
throw std::runtime_error("cannot write");
}
}
};

Array2DFactory

Array2DFactory is a factory for an Array2D. It will allow our program to smoothly switch between any kind of Array2D:
template<typename T>
class Array2DFactory
{
private:
bool stored;
public:
//typedef typename Array2D<T>::size_type size_type;
Array2DFactory():stored(false) {}
virtual ~Array2DFactory() {}

void setStored(bool stored)
{
this->stored=stored;
}
bool isStored() const
{
return this->stored;
}

/** creates a new Array2D */
Array2D<T>* newArray(
size_type width,
size_type height
) const
{
if(isStored())
{
return new StoredArray2D<T>(width,height);
}
return new DefaultArray2D<T>(width,height);
}
};

Penaly

Penalty is an abstract class that will used to fill the edge of the matrix. It returns the score for inserting a symbol in a given sequence at a given position!
template<typename SCORE>
class Penalty
{
public:
Penalty() {}
virtual ~Penalty() {}
virtual SCORE get(const SequencePtr seq,int position) const=0;
};

DefaultPenalty

DefaultPenalty is a concrete implementation of Penalty. It uses a constant value for any position of the sequence.
template<typename SCORE>
class DefaultPenalty:public Penalty<SCORE>
{
private:
SCORE value;
public:
DefaultPenalty(SCORE value):value(value) {}
virtual ~DefaultPenalty() {}
virtual SCORE get(const SequencePtr seq,int position) const
{
return value;
}
};

Aligner

Aligner is an abstract pairwise aligner. It contains two Sequences, an Array2DFactory, a SubstitutionMatrix, etc... Its function align aligns the two sequences and path return the Path for the two aligned sequences:
template<typename T,typename SCORE>
class Aligner
{
private:
/** horizontal sequence */
SequencePtr seqX;
/** vertical sequence */
SequencePtr seqY;
/** array2D factory */
Array2DFactory<T>* array2dfactory;
/** substitution matrix */
SubstitutionMatrix<SCORE>* subsitutionMatrix;
/** penalty X */
Penalty<SCORE>* penaltyX;
/** penalty Y */
Penalty<SCORE>* penaltyY;
protected:
Aligner():seqX(NULL),seqY(NULL),
array2dfactory(NULL),
subsitutionMatrix(NULL)
{
}
public:
(...)
virtual SCORE align()=0;
virtual std::auto_ptr<Path> path()=0;
};

Needleman

Needleman is an implementation of Aligner for the Needleman & Wunsch algorithm.
template<typename T,typename SCORE>
class Needleman:public Aligner<T,SCORE>
{
private:
/** current matrix */
Array2D<T>* matrix;
public:
Needleman( ):matrix(NULL)
{

}

virtual ~Needleman()
{
}

/** clear the internal data if it exists */
virtual void clear()
{
Aligner<T,SCORE>::clear();
if(matrix!=NULL) delete matrix;
matrix=NULL;
}

virtual SCORE align()
{
/* clear the previous Array2D */
clear();
/* ask the factory for a new Array2D */
this->matrix = this->getArray2DFactory()->newArray(this->X()->size()+1, this->Y()->size()+1);

for(Sequence::size_type x=0;x<= this->X()->size();++x)
{
this->matrix->set(x,0,x*this->getPenaltyX()->get(this->X(),x) );
}
for(Sequence::size_type y=0;y<= this->Y()->size();++y)
{
this->matrix->set(0,y,y*this->getPenaltyY()->get(this->Y(),y));
}
for(Sequence::size_type x=1;x<= this->X()->size();++x)
{
for(Sequence::size_type y=1;y<= this->Y()->size();++y)
{
SCORE diag = matrix->get(x-1,y-1)+
this->compare(x-1,y-1);
SCORE delet= matrix->get(x-1,y) + this->getPenaltyX()->get(this->X(),x);
SCORE inser= matrix->get(x,y-1) + this->getPenaltyY()->get(this->Y(),y);
matrix->set(x,y,std::max(diag,std::max(delet,inser)));
}
}
return matrix->get(this->X()->size(),this->Y()->size());
}

public:

virtual std::auto_ptr<Path> path()
{
std::vector<PathItem> items;
Sequence::size_type x= this->X()->size();
Sequence::size_type y= this->Y()->size();
while (x>0 && y>0)
{
SCORE diag= matrix->get(x-1,y-1);
SCORE up = matrix->get(x, y- 1);
SCORE left = matrix->get(x-1,y);

if (diag >= up && diag >= left)
{
items.push_back(PathItem(x-1,y-1));
--x;
--y;
}
else if (left> up)
{
items.push_back(PathItem(x-1,-1));
--x;
}
else
{
items.push_back(PathItem(-1,y-1));
--y;
}
}

while (x > 0)
{
items.push_back(PathItem(x-1,-1));
--x;
}
while (y > 0)
{
items.push_back(PathItem(-1,y-1));
--y;
}
std::reverse(items.begin(),items.end());
return std::auto_ptr<Path>(new Path(this->X(),this->Y(),items));
}


};

SWaterman

SWaterman is an implementation of Aligner for the Smith & Waterman.
template<typename T,typename SCORE>
class SWaterman:public Aligner<T,SCORE>
{
private:
/** current matrix */
Array2D<T>* matrix;
SCORE best_score;
Sequence::size_type best_x;
Sequence::size_type best_y;
public:
SWaterman( ):matrix(NULL),best_score(0),best_x(0),best_y(0)
{

}

virtual ~SWaterman()
{
}

/** clear the internal data if it exists */
virtual void clear()
{
Aligner<T,SCORE>::clear();
if(matrix!=NULL) delete matrix;
matrix=NULL;
best_x=0;
best_y=0;
best_score=0;
}

virtual SCORE align()
{
/* clear the previous Array2D */
clear();
/* ask the factory for a new Array2D */
this->matrix = this->getArray2DFactory()->newArray(this->X()->size()+1, this->Y()->size()+1);

for(Sequence::size_type x=0;x<= this->X()->size();++x)
{
this->matrix->set(x,0,0);
}
for(Sequence::size_type y=0;y<= this->Y()->size();++y)
{
this->matrix->set(0,y,0);
}
for(Sequence::size_type x=1;x<= this->X()->size();++x)
{
for(Sequence::size_type y=1;y<= this->Y()->size();++y)
{
SCORE diag = matrix->get(x-1,y-1)+
this->compare(x-1,y-1);
SCORE delet= matrix->get(x-1,y) + this->getPenaltyX()->get(this->X(),x);
SCORE inser= matrix->get(x,y-1) + this->getPenaltyY()->get(this->Y(),y);

SCORE here= std::max((SCORE)0,std::max(diag,std::max(delet,inser)));
matrix->set(x,y,here);
if(best_score<here)
{
best_score=here;
best_x=x;
best_y=y;
}
}
}
return matrix->get(this->X()->size(),this->Y()->size());
}

public:

virtual std::auto_ptr<Path> path()
{
std::vector<PathItem> items;
Sequence::size_type x= best_x;
Sequence::size_type y= best_y;
while (x>0 && y>0 && matrix->get(x,y)!=0)
{
SCORE diag= matrix->get(x-1,y-1);
SCORE up = matrix->get(x, y- 1);
SCORE left = matrix->get(x-1,y);

if (diag >= up && diag >= left)
{
items.push_back(PathItem(x-1,y-1));
--x;
--y;
}
else if (left> up)
{
items.push_back(PathItem(x-1,-1));
--x;
}
else
{
items.push_back(PathItem(-1,y-1));
--y;
}
}

std::reverse(items.begin(),items.end());
return std::auto_ptr<Path>(new Path(this->X(),this->Y(),items));
}
};

AlignerFactory

AlignerFactory is a factory for an Aligner. It will allow our program to smoothly switch between any kind of Aligner:
template<typename T,typename SCORE>
class AlignerFactory
{
private:
bool local;
public:
AlignerFactory():local(false)
{
}

void setLocal(bool local) { this->local=local;}
bool isLocal() const { return local;}
std::auto_ptr<Aligner<T,SCORE> > newAligner()
{
if(isLocal())
{
return std::auto_ptr<Aligner<T,SCORE> >(new SWaterman<T,SCORE>());
}
return std::auto_ptr<Aligner<T,SCORE> >(new Needleman<T,SCORE>());
}
};

All in one, the full source code

#include <cstring>
#include <cstdlib>
#include <cstdio>
#include <cctype>
#include <algorithm>
#include <cassert>
#include <iostream>
#include <iomanip>
#include <stdexcept>
#include <vector>
#include <memory>
#include <sstream>

typedef signed long size_type;


typedef class Sequence
{
public:
typedef signed long size_type;
Sequence()
{
}
virtual ~Sequence()
{
}

virtual size_type size() const=0;
virtual char at(size_type index) const=0;
char operator[](size_type index) const
{
return this->at(index);
}
virtual std::ostream& print(std::ostream& out) const
{
size_type L=size();
for(size_type i=0;i< L;++i) out << at(i);
return out;
}
}*SequencePtr;

std::ostream& operator << (std::ostream& out, const Sequence& seq)
{
return seq.print(out);
}

class CCSequence:public Sequence
{
private:
std::string str;
public:
CCSequence(const std::string& str):Sequence(),str(str)
{
}
virtual ~CCSequence()
{
}
virtual size_type size() const
{
return (size_type)str.size();
}
virtual char at(size_type index) const
{
return str[index];
}
};

class PtrSequence:public Sequence
{
private:
const char* str;
std::size_t length;
public:
PtrSequence(const char* str,size_t length):Sequence(),str(str),length(length)
{
assert(str!=NULL);
assert(length>=0);
}
PtrSequence(const char* str):str(str),length(0UL)
{
assert(str!=NULL);
length=std::strlen(str);
}
virtual ~PtrSequence()
{
}
virtual size_type size() const
{
return (size_type)this->length;
}
virtual char at(size_type index) const
{
assert(index>=0);
assert(index<size());
return str[index];
}
};

typedef struct PathItem
{
Sequence::size_type x;
Sequence::size_type y;
PathItem(Sequence::size_type x,Sequence::size_type y):x(x),y(y)
{
}
Sequence::size_type get(int side) const { return (side==0?x:y);}
}*PathItemPtr;

typedef class Path: public Sequence
{
private:
const SequencePtr seqX;
const SequencePtr seqY;
std::vector<PathItem> path;

std::auto_ptr<Sequence> _asSeq(int side) const
{
const SequencePtr seq=(side==0?seqX:seqY);
std::vector<PathItem>::const_iterator r=path.begin();
std::vector<PathItem>::const_iterator r_end=path.end();
std::ostringstream os;
while(r!=r_end)
{
Sequence::size_type n=r->get(side);
os << (n==-1?'-':seq->at(n));
++r;
}
return std::auto_ptr<Sequence>(new CCSequence(os.str()));
}
public:
Path(const SequencePtr seqX,const SequencePtr seqY,std::vector<PathItem> path):Sequence(),seqX(seqX),seqY(seqY),path(path)
{
}

virtual ~Path()
{
}

const SequencePtr X() const
{
return seqX;
}
const SequencePtr Y() const
{
return seqY;
}
std::auto_ptr<Sequence> consensusX() const
{
return _asSeq(0);
}
std::auto_ptr<Sequence> consensusY() const
{
return _asSeq(1);
}
std::auto_ptr<std::string> mid() const
{
std::vector<PathItem>::const_iterator r=path.begin();
std::vector<PathItem>::const_iterator r_end=path.end();
std::ostringstream os;
while(r!=r_end)
{
if(r->x==-1 || r->y==-1)
{
os << ' ';
}
else if(std::toupper(X()->at(r->x))==std::toupper(Y()->at(r->y)))
{
os << '|';
}
else
{
os << ' ';
}
++r;
}
return std::auto_ptr<std::string>(new std::string(os.str()));
}
virtual size_type size() const
{
return path.size();
}
virtual char at(size_type index) const
{
const PathItem& i=path.at(index);
if(i.x==-1 && i.y==-1) return '-';
if(i.x==-1) return Y()->at(i.y);
if(i.y==-1) return X()->at(i.x);
char cx= X()->at(i.x);
char cy= Y()->at(i.y);
return cx==cy?cx:'X';
}

std::ostream& print(std::ostream& out) const
{
out << *(consensusX()) << "\n"
<< *(mid()) << "\n"
<< *(consensusY()) << "\n"
;
return out;
}
}*PathPtr;



std::ostream& operator << (std::ostream& out, const Path& path)
{
return path.print(out);
}


template<typename SCORE>
class SubstitutionMatrix
{
public:
virtual SCORE compare(char c1,char c2) const=0;
};



template<typename SCORE>
class Identity:public SubstitutionMatrix<SCORE>
{
public:
virtual SCORE compare(char c1,char c2) const
{
return (SCORE)(std::toupper(c1)==std::toupper(c2)?1:-1);
}
};

/** cf. ftp://ftp.ncbi.nih.gov/blast/matrices */
template<typename SCORE>
class Blosum:public SubstitutionMatrix<SCORE>
{
protected:
virtual int aa2index(char c) const
{
switch(std::toupper(c))
{
case 'A': return 0;
case 'R': return 1;
case 'N': return 2;
case 'D': return 3;
case 'C': return 4;
case 'Q': return 5;
case 'E': return 6;
case 'G': return 7;
case 'H': return 8;
case 'I': return 9;
case 'L': return 10;
case 'K': return 11;
case 'M': return 12;
case 'F': return 13;
case 'P': return 14;
case 'S': return 15;
case 'T': return 16;
case 'W': return 17;
case 'Y': return 18;
case 'V': return 19;
case 'B': return 20;
case 'Z': return 21;
case 'X': return 22;
default: return 23;
}
}
public:
virtual SCORE compare(char c1,char c2) const=0;
};


static const signed char _blosum62[]={
4,-1,-2,-2,0,-1,-1,0,-2,-1,-1,-1,-1,-2,-1,1,0,-3,-2,0,-2,-1,0,-4,
-1,5,0,-2,-3,1,0,-2,0,-3,-2,2,-1,-3,-2,-1,-1,-3,-2,-3,-1,0,-1,-4,
-2,0,6,1,-3,0,0,0,1,-3,-3,0,-2,-3,-2,1,0,-4,-2,-3,3,0,-1,-4,
-2,-2,1,6,-3,0,2,-1,-1,-3,-4,-1,-3,-3,-1,0,-1,-4,-3,-3,4,1,-1,-4,
0,-3,-3,-3,9,-3,-4,-3,-3,-1,-1,-3,-1,-2,-3,-1,-1,-2,-2,-1,-3,-3,-2,-4,
-1,1,0,0,-3,5,2,-2,0,-3,-2,1,0,-3,-1,0,-1,-2,-1,-2,0,3,-1,-4,
-1,0,0,2,-4,2,5,-2,0,-3,-3,1,-2,-3,-1,0,-1,-3,-2,-2,1,4,-1,-4,
0,-2,0,-1,-3,-2,-2,6,-2,-4,-4,-2,-3,-3,-2,0,-2,-2,-3,-3,-1,-2,-1,-4,
-2,0,1,-1,-3,0,0,-2,8,-3,-3,-1,-2,-1,-2,-1,-2,-2,2,-3,0,0,-1,-4,
-1,-3,-3,-3,-1,-3,-3,-4,-3,4,2,-3,1,0,-3,-2,-1,-3,-1,3,-3,-3,-1,-4,
-1,-2,-3,-4,-1,-2,-3,-4,-3,2,4,-2,2,0,-3,-2,-1,-2,-1,1,-4,-3,-1,-4,
-1,2,0,-1,-3,1,1,-2,-1,-3,-2,5,-1,-3,-1,0,-1,-3,-2,-2,0,1,-1,-4,
-1,-1,-2,-3,-1,0,-2,-3,-2,1,2,-1,5,0,-2,-1,-1,-1,-1,1,-3,-1,-1,-4,
-2,-3,-3,-3,-2,-3,-3,-3,-1,0,0,-3,0,6,-4,-2,-2,1,3,-1,-3,-3,-1,-4,
-1,-2,-2,-1,-3,-1,-1,-2,-2,-3,-3,-1,-2,-4,7,-1,-1,-4,-3,-2,-2,-1,-2,-4,
1,-1,1,0,-1,0,0,0,-1,-2,-2,0,-1,-2,-1,4,1,-3,-2,-2,0,0,0,-4,
0,-1,0,-1,-1,-1,-1,-2,-2,-1,-1,-1,-1,-2,-1,1,5,-2,-2,0,-1,-1,0,-4,
-3,-3,-4,-4,-2,-2,-3,-2,-2,-3,-2,-3,-1,1,-4,-3,-2,11,2,-3,-4,-3,-2,-4,
-2,-2,-2,-3,-2,-1,-2,-3,2,-1,-1,-2,-1,3,-3,-2,-2,2,7,-1,-3,-2,-1,-4,
0,-3,-3,-3,-1,-2,-2,-3,-3,3,1,-2,1,-1,-2,-2,0,-3,-1,4,-3,-2,-1,-4,
-2,-1,3,4,-3,0,1,-1,0,-3,-4,0,-3,-3,-2,0,-1,-4,-3,-3,4,1,-1,-4,
-1,0,0,1,-3,3,4,-2,0,-3,-3,1,-1,-3,-1,0,-1,-3,-2,-2,1,4,-1,-4,
0,-1,-1,-1,-2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-2,0,0,-2,-1,-1,-1,-1,-1,-4,
-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,-4,1
};

/** ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 */
template<typename SCORE>
class Blosum62:public Blosum<SCORE>
{
private:

const static signed char blosum62[];
public:
virtual SCORE compare(char c1,char c2) const
{
return (SCORE)::_blosum62[this->aa2index(c1)*24+ this->aa2index(c2)];
}
};




template<typename T>
class Array2D
{
public:
//typedef unsigned int size_type;
Array2D() {}
virtual ~Array2D() {}

virtual size_type width() const=0;
virtual size_type height() const=0;
virtual T get(size_type x,size_type y) const=0;
virtual void set(size_type x,size_type y,T val)=0;
virtual std::ostream& print(std::ostream& out) const
{
for(size_type j=0;j< height();++j)
{
for(size_type i=0;i< width();++i)
{
if(i>0) out << " ";
out << std::setw(4) << get(i,j);
}
out << std::endl;
}
return out;
}
protected:
std::size_t offset(size_type x,size_type y) const
{
return (this->width())*y + x;
}
};


template<typename T>
class DefaultArray2D:public Array2D<T>
{
protected:
size_type _width;
size_type _height;
T* matrix;

public:
DefaultArray2D( size_type width, size_type height):_width(width),_height(height)
{
matrix=new T[width*height];
}

virtual ~DefaultArray2D()
{
delete [] matrix;
}

virtual size_type width() const
{
return this->_width;
}

virtual size_type height() const
{
return this->_height;
}

virtual T get(size_type x,size_type y) const
{
return matrix[this->offset(x,y)];
}
virtual void set(size_type x,size_type y,T val)
{
matrix[this->offset(x,y)]=val;
}
};

template<typename T>
class StoredArray2D:public Array2D<T>
{
private:
size_type _width;
size_type _height;
/** FILE pointer to the binary file */
mutable FILE* io_ptr;
void move(size_type x, size_type y) const
{
if(std::fseek (io_ptr,this->offset(x,y)*sizeof(T), SEEK_SET)!=0)
{
throw std::runtime_error("Cannot fseek");
}
}
public:
StoredArray2D( size_type width, size_type height):_width(width),_height(height),io_ptr(NULL)
{
io_ptr= std::tmpfile();
if(io_ptr==NULL) throw std::runtime_error("Cannot open tmp file");
T data;
std::memset(&data,0,sizeof(T));
for(std::size_t i=0;i< (size_t)(width*height);++i)
{
if(std::fwrite((void*)&data,sizeof(T),1,io_ptr)!=1)
{
std::fclose(io_ptr);
io_ptr=NULL;
throw std::runtime_error("write matrix");
}
}
}

virtual ~StoredArray2D()
{
if(io_ptr!=NULL) std::fclose(io_ptr);
}

virtual size_type width() const
{
return this->_width;
}

virtual size_type height() const
{
return this->_height;
}

virtual T get(size_type x,size_type y) const
{
T data;
move(x,y);
if(std::fread((void*)&data,sizeof(T),1,io_ptr)!=1)
{
throw std::runtime_error("cannot read");
}
return data;
}
virtual void set(size_type x,size_type y,T data)
{
move(x,y);
if(std::fwrite((void*)&data,sizeof(T),1,io_ptr)!=1)
{
throw std::runtime_error("cannot write");
}
}
};


template<typename T>
class Array2DFactory
{
private:
bool stored;
public:
//typedef typename Array2D<T>::size_type size_type;
Array2DFactory():stored(false) {}
virtual ~Array2DFactory() {}

void setStored(bool stored)
{
this->stored=stored;
}
bool isStored() const
{
return this->stored;
}

/** creates a new Array2D */
Array2D<T>* newArray(
size_type width,
size_type height
) const
{
if(isStored())
{
return new StoredArray2D<T>(width,height);
}
return new DefaultArray2D<T>(width,height);
}
};


template<typename SCORE>
class Penalty
{
public:
Penalty() {}
virtual ~Penalty() {}
virtual SCORE get(const SequencePtr seq,int position) const=0;
};

template<typename SCORE>
class DefaultPenalty:public Penalty<SCORE>
{
private:
SCORE value;
public:
DefaultPenalty(SCORE value):value(value) {}
virtual ~DefaultPenalty() {}
virtual SCORE get(const SequencePtr seq,int position) const
{
return value;
}
};

template<typename T,typename SCORE>
class Aligner
{
private:
/** horizontal sequence */
SequencePtr seqX;
/** vertical sequence */
SequencePtr seqY;
/** array2D factory */
Array2DFactory<T>* array2dfactory;
/** substitution matrix */
SubstitutionMatrix<SCORE>* subsitutionMatrix;
/** penalty X */
Penalty<SCORE>* penaltyX;
/** penalty Y */
Penalty<SCORE>* penaltyY;
protected:


Aligner():seqX(NULL),seqY(NULL),
array2dfactory(NULL),
subsitutionMatrix(NULL)
{
}
public:
/** clear the internal data if it exists */
virtual void clear()
{
}

virtual ~Aligner()
{
clear();
}
/** returns the horizontal sequence */
virtual const SequencePtr X() const
{
return this->seqX;
}
/** returns the vertical sequence */
virtual const SequencePtr Y() const
{
return this->seqY;
}
virtual void setX(SequencePtr seqX) { this->seqX=seqX; clear();}
virtual void setY(SequencePtr seqY) { this->seqY=seqY; clear();}
virtual void setArray2DFactory(Array2DFactory<T>* array2dfactory) { this->array2dfactory=array2dfactory; clear();}
virtual const Array2DFactory<T>* getArray2DFactory() const { return this->array2dfactory;}
virtual void setSubstitutionMatrix(SubstitutionMatrix<SCORE>* subsitutionMatrix) { this->subsitutionMatrix=subsitutionMatrix; clear();}
virtual const SubstitutionMatrix<SCORE>* getSubstitutionMatrix() const { return this->subsitutionMatrix;}
virtual void setPenaltyX(Penalty<SCORE>* penaltyX) { this->penaltyX=penaltyX; clear();}
virtual Penalty<SCORE>* getPenaltyX() const { return this->penaltyX;}
virtual void setPenaltyY(Penalty<SCORE>* penaltyY) { this->penaltyY=penaltyY; clear();}
virtual Penalty<SCORE>* getPenaltyY() const { return this->penaltyY;}

virtual SCORE align()=0;
virtual std::auto_ptr<Path> path()=0;
protected:
/** shortcut to compare X[x] and Y[y] */
int compare(Sequence::size_type posx,Sequence::size_type posy) const
{
return getSubstitutionMatrix()->compare(X()->at(posx),Y()->at(posy));
}
};

template<typename T,typename SCORE>
class Needleman:public Aligner<T,SCORE>
{
private:
/** current matrix */
Array2D<T>* matrix;
public:
Needleman( ):matrix(NULL)
{

}

virtual ~Needleman()
{
}

/** clear the internal data if it exists */
virtual void clear()
{
Aligner<T,SCORE>::clear();
if(matrix!=NULL) delete matrix;
matrix=NULL;
}

virtual SCORE align()
{
/* clear the previous Array2D */
clear();
/* ask the factory for a new Array2D */
this->matrix = this->getArray2DFactory()->newArray(this->X()->size()+1, this->Y()->size()+1);

for(Sequence::size_type x=0;x<= this->X()->size();++x)
{
this->matrix->set(x,0,x*this->getPenaltyX()->get(this->X(),x) );
}
for(Sequence::size_type y=0;y<= this->Y()->size();++y)
{
this->matrix->set(0,y,y*this->getPenaltyY()->get(this->Y(),y));
}
for(Sequence::size_type x=1;x<= this->X()->size();++x)
{
for(Sequence::size_type y=1;y<= this->Y()->size();++y)
{
SCORE diag = matrix->get(x-1,y-1)+
this->compare(x-1,y-1);
SCORE delet= matrix->get(x-1,y) + this->getPenaltyX()->get(this->X(),x);
SCORE inser= matrix->get(x,y-1) + this->getPenaltyY()->get(this->Y(),y);
matrix->set(x,y,std::max(diag,std::max(delet,inser)));
}
}
return matrix->get(this->X()->size(),this->Y()->size());
}

public:

virtual std::auto_ptr<Path> path()
{
std::vector<PathItem> items;
Sequence::size_type x= this->X()->size();
Sequence::size_type y= this->Y()->size();
while (x>0 && y>0)
{
SCORE diag= matrix->get(x-1,y-1);
SCORE up = matrix->get(x, y- 1);
SCORE left = matrix->get(x-1,y);

if (diag >= up && diag >= left)
{
items.push_back(PathItem(x-1,y-1));
--x;
--y;
}
else if (left> up)
{
items.push_back(PathItem(x-1,-1));
--x;
}
else
{
items.push_back(PathItem(-1,y-1));
--y;
}
}

while (x > 0)
{
items.push_back(PathItem(x-1,-1));
--x;
}
while (y > 0)
{
items.push_back(PathItem(-1,y-1));
--y;
}
std::reverse(items.begin(),items.end());
return std::auto_ptr<Path>(new Path(this->X(),this->Y(),items));
}


};


template<typename T,typename SCORE>
class SWaterman:public Aligner<T,SCORE>
{
private:
/** current matrix */
Array2D<T>* matrix;
SCORE best_score;
Sequence::size_type best_x;
Sequence::size_type best_y;
public:
SWaterman( ):matrix(NULL),best_score(0),best_x(0),best_y(0)
{

}

virtual ~SWaterman()
{
}

/** clear the internal data if it exists */
virtual void clear()
{
Aligner<T,SCORE>::clear();
if(matrix!=NULL) delete matrix;
matrix=NULL;
best_x=0;
best_y=0;
best_score=0;
}

virtual SCORE align()
{
/* clear the previous Array2D */
clear();
/* ask the factory for a new Array2D */
this->matrix = this->getArray2DFactory()->newArray(this->X()->size()+1, this->Y()->size()+1);

for(Sequence::size_type x=0;x<= this->X()->size();++x)
{
this->matrix->set(x,0,0);
}
for(Sequence::size_type y=0;y<= this->Y()->size();++y)
{
this->matrix->set(0,y,0);
}
for(Sequence::size_type x=1;x<= this->X()->size();++x)
{
for(Sequence::size_type y=1;y<= this->Y()->size();++y)
{
SCORE diag = matrix->get(x-1,y-1)+
this->compare(x-1,y-1);
SCORE delet= matrix->get(x-1,y) + this->getPenaltyX()->get(this->X(),x);
SCORE inser= matrix->get(x,y-1) + this->getPenaltyY()->get(this->Y(),y);

SCORE here= std::max((SCORE)0,std::max(diag,std::max(delet,inser)));
matrix->set(x,y,here);
if(best_score<here)
{
best_score=here;
best_x=x;
best_y=y;
}
}
}
return matrix->get(this->X()->size(),this->Y()->size());
}

public:

virtual std::auto_ptr<Path> path()
{
std::vector<PathItem> items;
Sequence::size_type x= best_x;
Sequence::size_type y= best_y;
while (x>0 && y>0 && matrix->get(x,y)!=0)
{
SCORE diag= matrix->get(x-1,y-1);
SCORE up = matrix->get(x, y- 1);
SCORE left = matrix->get(x-1,y);

if (diag >= up && diag >= left)
{
items.push_back(PathItem(x-1,y-1));
--x;
--y;
}
else if (left> up)
{
items.push_back(PathItem(x-1,-1));
--x;
}
else
{
items.push_back(PathItem(-1,y-1));
--y;
}
}

std::reverse(items.begin(),items.end());
//matrix->print(std::cout);
//std::cout << best_score << "=" << best_x<<"," << best_y << std::endl;
return std::auto_ptr<Path>(new Path(this->X(),this->Y(),items));
}


};

template<typename T,typename SCORE>
class AlignerFactory
{
private:
bool local;
public:
AlignerFactory():local(false)
{
}

void setLocal(bool local) { this->local=local;}
bool isLocal() const { return local;}
std::auto_ptr<Aligner<T,SCORE> > newAligner()
{
if(isLocal())
{
return std::auto_ptr<Aligner<T,SCORE> >(new SWaterman<T,SCORE>());
}
return std::auto_ptr<Aligner<T,SCORE> >(new Needleman<T,SCORE>());
}
};




typedef signed long penalty_t;

/**
*
* main
*
*/
int main(int argc,char **argv)
{
try
{
int optind=1;
bool align_local=false;
bool stored_matrix=false;
bool use_blosum62=true;
penalty_t gX=-5;
penalty_t gY=-5;
while(optind < argc)
{
if(std::strcmp(argv[optind],"-h")==0)
{
std::cout << argv[0] << ". Pierre Lindenbaum PhD. Compiled on " << __DATE__ << " at " << __TIME__ << "\n" <<
"options\n" <<
" -h Help (this screen)\n" <<
" -L local-alignement\n" <<
" -s stored matrix (slower)\n" <<
" -i identity matrix (instead of blosum62)\n" <<
" -x penalty X (" <<gX <<")\n" <<
" -y penalty Y (" <<gY <<")\n" <<
"seq1 seq2\n" <<
std::endl;
std::exit(EXIT_FAILURE);
}
else if(std::strcmp(argv[optind],"-L")==0)
{
align_local=true;
}
else if(std::strcmp(argv[optind],"-s")==0)
{
stored_matrix=true;
}
else if(std::strcmp(argv[optind],"-i")==0)
{
use_blosum62=false;
}
else if(std::strcmp(argv[optind],"-x")==0)
{
gX=(penalty_t)atol(argv[++optind]);
}
else if(std::strcmp(argv[optind],"-y")==0)
{
gY=(penalty_t)atol(argv[++optind]);
}
else if(std::strcmp(argv[optind],"--")==0)
{
++optind;
break;
}
else if(argv[optind][0]=='-')
{
std::cerr << "unknown option " << argv[optind] << std::endl;
std::exit(EXIT_FAILURE);
}
else
{
break;
}
++optind;
}
if(optind+2!=argc)
{
std::cerr << "expected only two args"<< std::endl;
std::exit(EXIT_FAILURE);
}
Blosum62<penalty_t> blosum;
Identity<penalty_t> identity;


Array2DFactory<int> factory;
factory.setStored(stored_matrix);

DefaultPenalty<penalty_t> penalty_x(gX);
DefaultPenalty<penalty_t> penalty_y(gY);
AlignerFactory<int,penalty_t> algofactory;
algofactory.setLocal(align_local);

std::auto_ptr<Aligner<int,penalty_t> > algo = algofactory.newAligner();

algo->setArray2DFactory(&factory);
algo->setSubstitutionMatrix(use_blosum62?(SubstitutionMatrix<penalty_t>*) &blosum :(SubstitutionMatrix<penalty_t>*) &identity);
algo->setPenaltyX(&penalty_x);
algo->setPenaltyY(&penalty_y);


PtrSequence seq1(argv[optind++]);
PtrSequence seq2(argv[optind++]);

algo->setX(&seq1);
algo->setY(&seq2);
algo->align();

std::cout << (*(algo->path())) << std::endl;

}
catch(std::exception& err)
{
std::cerr << err.what() << std::endl;
}
catch(...)
{
std::cerr << "BOUM" << std::endl;
}
return EXIT_SUCCESS;
}

Compilation


g++ -Wall -O3 needle.cpp

Examples


./a.out -h
./a.out. Pierre Lindenbaum PhD. Compiled on Jul 8 2010 at 23:03:05
options
-h Help (this screen)
-L local-alignement
-s stored matrix (slower)
-i identity matrix (instead of blosum62)
-x penalty X (-5)
-y penalty Y (-5)
seq1 seq2


time ./a.out MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR
MERQK-----RKADIEKGLQFIQSTLP--LKQEEYEAFLLKLVQNLFAEGN
| | | ||||||| | || | | ||
MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR-------


real 0m0.006s
user 0m0.004s
sys 0m0.004s


time ./a.out -s MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR

MERQK-----RKADIEKGLQFIQSTLP--LKQEEYEAFLLKLVQNLFAEGN
| | | ||||||| | || | | ||
MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR-------


real 0m0.037s
user 0m0.004s
sys 0m0.032s


./a.out -i MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR

MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN
| | | |
MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR


./a.out -L MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR

ME--RQKRKADIEKGLQFIQSTLP--LKQEEYEAFLLKLVQ
| || | ||||||| | || | | ||
VSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR


./a.out -x -5 -y -20 -L MERQKRKADIEKGLQFIQSTLPLKQEEYEAFLLKLVQNLFAEGN MSSVSEERRKRQQNIKEGLQFIQSPLSYPGTQEQYAVYLHALVR

ME--RQKRKADIEKGLQFIQSTL
| || | ||||||| |
VSEERRKRQQNIKEGLQFIQSPL


That's it

Pierre

3 comments:

Bilouweb said...

Hi Pierre.

I just have a question because I also play with C++.

You say these classes are not optimized and it's just for fun.

Why do you rewrite an interface "Sequence" with implementation classes "CCsequence" and "PtrSequence" instead of using a simple class like this :

Class Sequence : public string {}

String possess all the feature you need for sequence, doesn't it ?

Pierre Lindenbaum said...

because I wanted to use an interface rather than a class and there is no need to copy everything in memory. A sequence could be just a substring of another (non-mutable) string: i just need its start offset and its length: here there would be no copy of the string.

One could also imagine that this string would be stored in a database, in a flat file , in a mmap, etc... etc....

rch said...

Nice Pierre - thanks for sharing some fun and interesting code. It makes for a succinct example that manages to illustrate a wide variety of cross-discipline concepts.

Cheers!