03 March 2009

String Challenge: My (brute-force) solution.

In this post, I present my (brute-force/quick n'dirty) solution to the recent 'String Challenge' submited by Thomas Mailund on his blog: http://www.mailund.dk/index.php/2009/03/02/a-string-algorithms-challenge/. Briefly, here is the problem:

Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.



I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.

The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)


The code



First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.

File envHome=new File(System.getProperty("java.io.tmpdir"));
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);

//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();


The sequence is then scanned and an array of bytes of length k is filled with the characters.

FileReader fin= new FileReader(file);
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....


while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;

}



This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.

key.setData(array);

int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}

IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);

//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;


At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.



//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();

Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();


This second database is then scanned to print the ordered results.

while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}


Compilation



javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java

Excecution


java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge

Test with the Human chr22 (~49E6 bp)


time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430

real 5m44.240s
user 5m56.186s
sys 0m3.178s

time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428

real 8m15.152s
user 8m41.429s
sys 0m3.748s



That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)

No comments: