04 September 2009

My Notes about Bowtie


Image via wikipedia

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie is open source http://bowtie.cbcb.umd.edu (Langmead B, & al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biol. 2009;10(3):R25.)

Building an indexed Database


The fasta sequences retrieved from the query
"ROTAVIRUS[organism] VP1 Human"
were saved to the file ./genomes/rotavirus.fna.
The file was then indexed with ./bowtie-build
./bowtie-build genomes/rotavirus.fna indexes/rotavirus

Settings:
Output files: "rotavirus.*.ebwt"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 5 (one in 32)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Reference base cutoff: none
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:4, int:4, long:4, size_t:4
Input files DNA, FASTA:
genomes/rotavirus.fna
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
= 316434 (0 characters of padding)
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 79108
Using parameters --bmax 59331 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 59331 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 6; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 39553.4 (target: 59330)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 38593
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 38594
Getting block 2 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 53650
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 53651
Getting block 3 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 45711
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 45712
Getting block 4 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 18201
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 18202
Getting block 5 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 56112
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 56113
Getting block 6 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 24508
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 24509
Getting block 7 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 52958
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 52959
Getting block 8 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 26694
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 26695
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 117471
fchr[G]: 165360
fchr[T]: 220740
fchr[$]: 316434
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4302635 bytes to primary EBWT file: rotavirus.1.ebwt
Wrote 39560 bytes to secondary EBWT file: rotavirus.2.ebwt
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 316434
bwtLen: 316435
sz: 79109
bwtSz: 79109
lineRate: 6
linesPerSide: 1
offRate: 5
offMask: 0xffffffe0
isaRate: -1
isaMask: 0xffffffff
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 9889
offsSz: 39556
isaLen: 0
isaSz: 0
lineSz: 64
sideSz: 64
sideBwtSz: 56
sideBwtLen: 224
numSidePairs: 707
numSides: 1414
numLines: 1414
ebwtTotLen: 90496
ebwtTotSz: 90496
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
= 316434 (0 characters of padding)
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 79108
Using parameters --bmax 59331 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 59331 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 45204 (target: 59330)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 55484
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 55485
Getting block 2 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 58214
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 58215
Getting block 3 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 58224
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 58225
Getting block 4 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 33666
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 33667
Getting block 5 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 48159
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 48160
Getting block 6 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 45260
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 45261
Getting block 7 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 17421
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 17422
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 117471
fchr[G]: 165360
fchr[T]: 220740
fchr[$]: 316434
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4302635 bytes to primary EBWT file: rotavirus.rev.1.ebwt
Wrote 39560 bytes to secondary EBWT file: rotavirus.rev.2.ebwt
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 316434
bwtLen: 316435
sz: 79109
bwtSz: 79109
lineRate: 6
linesPerSide: 1
offRate: 5
offMask: 0xffffffe0
isaRate: -1
isaMask: 0xffffffff
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 9889
offsSz: 39556
isaLen: 0
isaSz: 0
lineSz: 64
sideSz: 64
sideBwtSz: 56
sideBwtLen: 224
numSidePairs: 707
numSides: 1414
numLines: 1414
ebwtTotLen: 90496
ebwtTotSz: 90496
Total time for backward call to driver() for mirror index: 00:00:00


Searching


The short sequences found with the query "NCBI: ROTAVIRUS[organism] 10:150[SLEN] bovine" were downloaded to ./reads/short-rota.fasta.
Those sequences were then mapped on the indexed sequences:
./bowtie -f rotavirus reads/short-rota.fasta | java -jar ~/lindenb/build/verticalize.jar -n
Reported 14 alignments to 1 output stream(s)
>>1
$1 ? : gi|62781718|gb|AR643436.1| Sequence 6 from patent US 6867021
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7504
$5 ? : GTGGCTTCCATTAGAAGCATG
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 1
<<1
>>2
$1 ? : gi|62781717|gb|AR643435.1| Sequence 5 from patent US 6867021
$2 ? : +
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7230
$5 ? : ACCACCAAATATGACACCAGC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 1
<<2
>>3
$1 ? : gi|56638523|gb|AR590758.1| Sequence 32 from patent US 6805867
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 5633
$5 ? : GGATGGCCAACAGGATC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
$8 ? : 2:T>A,5:T>A
<<3
>>4
$1 ? : gi|56638522|gb|AR590757.1| Sequence 31 from patent US 6805867
$2 ? : +
$3 ? : gi|5706619|gb|AF106283.1|AF106283 Human rotavirus G2 strain TA25 VP7 protein (VP7) gene, complete cds
$4 ? : 22
$5 ? : GTATGGTATTGAATATACCAC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 16
<<4
>>5
$1 ? : gi|56638509|gb|AR590744.1| Sequence 18 from patent US 6805867
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7658
$5 ? : TAGTGAGAGGATGTGACC
$6 ? : IIIIIIIIIIIIIIIIII
$7 ? : 1
<<5
>>6
$1 ? : gi|56638508|gb|AR590743.1| Sequence 17 from patent US 6805867
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 6320
$5 ? : GGCTTTTAAACGAAGTC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
<<6
>>7
$1 ? : gi|56638505|gb|AR590740.1| Sequence 14 from patent US 6805867
$2 ? : -
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 15298
$5 ? : TGTGGAGATATGACC
$6 ? : IIIIIIIIIIIIIII
$7 ? : 1
<<7
>>8
$1 ? : gi|56638504|gb|AR590739.1| Sequence 13 from patent US 6805867
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 12626
$5 ? : GGCTATTAAAGGT
$6 ? : IIIIIIIIIIIII
$7 ? : 1
$8 ? : 12:C>T
<<8
>>9
$1 ? : gi|10003339|gb|AR076593.1|AR076593 Sequence 32 from patent US 5959093
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 5633
$5 ? : GGATGGCCAACAGGATC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
$8 ? : 2:T>A,5:T>A
<<9
>>10
$1 ? : gi|10003338|gb|AR076592.1|AR076592 Sequence 31 from patent US 5959093
$2 ? : +
$3 ? : gi|5706619|gb|AF106283.1|AF106283 Human rotavirus G2 strain TA25 VP7 protein (VP7) gene, complete cds
$4 ? : 22
$5 ? : GTATGGTATTGAATATACCAC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 16
<<10
>>11
$1 ? : gi|10003325|gb|AR076579.1|AR076579 Sequence 18 from patent US 5959093
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7658
$5 ? : TAGTGAGAGGATGTGACC
$6 ? : IIIIIIIIIIIIIIIIII
$7 ? : 1
<<11
>>12
$1 ? : gi|10003324|gb|AR076578.1|AR076578 Sequence 17 from patent US 5959093
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 6320
$5 ? : GGCTTTTAAACGAAGTC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
<<12
>>13
$1 ? : gi|10003321|gb|AR076575.1|AR076575 Sequence 14 from patent US 5959093
$2 ? : -
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 15298
$5 ? : TGTGGAGATATGACC
$6 ? : IIIIIIIIIIIIIII
$7 ? : 1
<<13
>>14
$1 ? : gi|10003320|gb|AR076574.1|AR076574 Sequence 13 from patent US 5959093
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 12626
$5 ? : GGCTATTAAAGGT
$6 ? : IIIIIIIIIIIII
$7 ? : 1
$8 ? : 12:C>T
<<14
(the verticalize is a small utility I wrote, it transforms a horizontal output to a vertical one).


In the 9th result says : the reverse sequence of "gi|10003339" is GGATGGCCAACAGGATC . This sequence matches "gi|78064499" at position '5633' there are two mismatches:2:T>A,5:T>A. Control: running a blas2seq on those two sequences gives the following result:


>gb|AR076593.1|AR076593 Sequence 32 from patent US 5959093
Length=17

Score = 22.9 bits (24), Expect = 0.014
Identities = 15/17 (88%), Gaps = 0/17 (0%)
Strand=Plus/Minus


Query 5634 GGATGGCCAACTGGTTC 5650
||||||||||| || ||
Sbjct 17 GGATGGCCAACAGGATC 1


That's it
Pierre

No comments: