30 August 2012

Next Generation Sequencing, GNU-Make and .INTERMEDIATE

I gave a crash course about NGS to a few colleagues today. For my demonstration I wrote a simple Makefile. Basically, it downloads a subset of the human chromosome 22, indexes it with bwa, generates a set of fastqs with wgsim, align the fastqs, generates the *.sai, the *.sam, the *.bam, sorts the bam and calls the SNPs with mpileup.

SAMDIR=/usr/local/package/samtools-0.1.18
SAMTOOLS=$(SAMDIR)/samtools
BCFTOOLS=$(SAMDIR)/bcftools/bcftools
BWA=/usr/local/package/bwa-0.6.1/bwa

%.bam : %.sam
        $(SAMTOOLS) view -o $@ -b -S -T chr22.fa $<
%.bam.bai : %.bam
        $(SAMTOOLS) index $<

variations.vcf: variations.bcf
        $(BCFTOOLS) view $< > $@

variations.bcf : align.sorted.bam align.sorted.bam.bai
            $(SAMTOOLS) mpileup -uf chr22.fa $< | $(BCFTOOLS) view -bvcg - > $@


align.sam : random_1.sai random_2.sai  
        $(BWA) sampe chr22.fa $^ random_1.fq.gz random_2.fq.gz > $@

chr22.fa:
        curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" |\
        gunzip -c | grep -v NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN > $@


random_1.sai :  random_1.fq.gz chr22.fa.bwt
        $(BWA) aln -f $@ chr22.fa $<

random_2.sai :  random_2.fq.gz chr22.fa.bwt
        $(BWA) aln -f $@ chr22.fa $<

random_1.fq.gz random_2.fq.gz : chr22.fa
        $(SAMDIR)/misc/wgsim -N 1000000 $< random_1.fq random_2.fq > wgsim.output
        gzip --best random_1.fq random_2.fq

chr22.fa.bwt : chr22.fa
        $(BWA) index -a bwtsw $<

chr22.fa.fai :  chr22.fa
        $(SAMTOOLS) faidx $< 

align.sorted.bam : align.bam
        $(SAMTOOLS) sort $< align.sorted


clean:
        rm -f chr22.* *.bam *.vcf *.bcf *.sai *.gz *.fq *.bai  wgsim.output *.sam
I was asked about the weakness of "Make":
"The problem here, is that you need to generate a bunch of (possibly huge) files that won't be of use later: the *.sam, the unsorted *.bam, the *.sai, etc... if one of those files is removed everything will be re-compiled even if the final sorted.bam and the VCF exist."

I asked StackOverflow ( http://stackoverflow.com/questions/12199237 ) if a solution to fix this problem would exist and I received a nice solution from Eldar Abusalimov:

Use "intermediate files" feature of GNU Make:


Intermediate files are remade using their rules just like all other files. But intermediate files are treated differently in two ways.


The first difference is what happens if the intermediate file does not exist. If an ordinary file b does not exist, and make considers a target that depends on b, it invariably creates b and then updates the target from b. But if b is an intermediate file, then make can leave well enough alone. It won't bother updating b, or the ultimate target, unless some prerequisite of b is newer than that target or there is some other reason to update that target.


The second difference is that if make does create b in order to update something else, it deletes b later on after it is no longer needed. Therefore, an intermediate file which did not exist before make also does not exist after make. make reports the deletion to you by printing a rm -f command showing which file it is deleting.


Ordinarily, a file cannot be intermediate if it is mentioned in the makefile as a target or prerequisite. However, you can explicitly mark a file as intermediate by listing it as a prerequisite of the special target .INTERMEDIATE. This takes effect even if the file is mentioned explicitly in some other way.


You can prevent automatic deletion of an intermediate file by marking it as a secondary file. To do this, list it as a prerequisite of the special target .SECONDARY. When a file is secondary, make will not create the file merely because it does not already exist, but make does not automatically delete the file. Marking a file as secondary also marks it as intermediate.


So, adding the following line to the Makefile should be enough:


I've added the simple line below to my Makefile:
.INTERMEDIATE : align.sam random_1.sai random_2.sai align.bam variations.bcf

and I've invoked make:
[lindenb@srv-clc-02 20120830.demongs]$ make
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" |\
    gunzip -c | grep -v N | head -n 100000 > chr22.fa
/usr/local/package/samtools-0.1.18/misc/wgsim -N 1000000 chr22.fa random_1.fq random_2.fq > wgsim.output
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 4999950
gzip -f --best random_1.fq random_2.fq
/usr/local/package/bwa-0.6.1/bwa index -a bwtsw chr22.fa
[bwa_index] Pack FASTA... 0.12 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=9999900, availableWord=12703528
[bwt_gen] Finished constructing BWT in 5 iterations.
[bwa_index] 4.02 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.08 sec
[bwa_index] Construct SA from BWT and Occ... 1.46 sec
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa index -a bwtsw chr22.fa
[main] Real time: 5.740 sec; CPU: 5.740 sec
/usr/local/package/bwa-0.6.1/bwa aln -f random_1.sai chr22.fa random_1.fq.gz
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 26.07 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 26.51 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 26.80 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 21.77 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa aln -f random_1.sai chr22.fa random_1.fq.gz
[main] Real time: 104.702 sec; CPU: 104.675 sec
/usr/local/package/bwa-0.6.1/bwa aln -f random_2.sai chr22.fa random_2.fq.gz
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 28.40 sec
[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 28.94 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 29.18 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 23.07 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa aln -f random_2.sai chr22.fa random_2.fq.gz
[main] Real time: 113.270 sec; CPU: 113.233 sec
/usr/local/package/bwa-0.6.1/bwa sampe chr22.fa random_1.sai random_2.sai random_1.fq.gz random_2.fq.gz > align.sam
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 220114 pairs: 499.899 +/- 49.771
[infer_isize] skewness: -0.006; kurtosis: -0.083; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.34 sec
[bwa_sai2sam_pe_core] changing coordinates of 6560 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16796 out of 16796 Q17 singletons are mated.
[bwa_paired_sw] 27 out of 27 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.83 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.97 sec
[bwa_sai2sam_pe_core] print alignments... 2.46 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 219840 pairs: 499.874 +/- 49.751
[infer_isize] skewness: 0.001; kurtosis: -0.072; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.36 sec
[bwa_sai2sam_pe_core] changing coordinates of 6668 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16833 out of 16834 Q17 singletons are mated.
[bwa_paired_sw] 38 out of 38 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.78 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.98 sec
[bwa_sai2sam_pe_core] print alignments... 2.55 sec
[bwa_sai2sam_pe_core] 524288 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 220140 pairs: 500.062 +/- 49.780
[infer_isize] skewness: -0.000; kurtosis: -0.075; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.76 sec
[bwa_sai2sam_pe_core] changing coordinates of 6522 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16804 out of 16806 Q17 singletons are mated.
[bwa_paired_sw] 33 out of 33 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.79 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 1.07 sec
[bwa_sai2sam_pe_core] print alignments... 2.57 sec
[bwa_sai2sam_pe_core] 786432 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 179161 pairs: 499.910 +/- 49.860
[infer_isize] skewness: -0.001; kurtosis: -0.075; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 796 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 6.22 sec
[bwa_sai2sam_pe_core] changing coordinates of 5351 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 13904 out of 13905 Q17 singletons are mated.
[bwa_paired_sw] 27 out of 27 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 3.97 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.86 sec
[bwa_sai2sam_pe_core] print alignments... 2.10 sec
[bwa_sai2sam_pe_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa sampe chr22.fa random_1.sai random_2.sai random_1.fq.gz random_2.fq.gz
[main] Real time: 70.067 sec; CPU: 69.984 sec
/usr/local/package/samtools-0.1.18/samtools view -o align.bam -b -S -T chr22.fa align.sam
[samopen] SAM header is present: 1 sequences.
/usr/local/package/samtools-0.1.18/samtools sort align.bam align.sorted
/usr/local/package/samtools-0.1.18/samtools index align.sorted.bam
/usr/local/package/samtools-0.1.18/samtools mpileup -uf chr22.fa align.sorted.bam | /usr/local/package/samtools-0.1.18/bcftools/bcftools view -bvcg - > variations.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[bcfview] 100000 sites processed.
[afs] 0:99762.176 1:154.256 2:83.568
[bcfview] 200000 sites processed.
[afs] 0:99790.512 1:146.516 2:62.972
[bcfview] 300000 sites processed.
[afs] 0:99750.990 1:143.270 2:105.740
[bcfview] 400000 sites processed.
[afs] 0:99764.146 1:156.783 2:79.071
[bcfview] 500000 sites processed.
[afs] 0:99749.220 1:177.647 2:73.133
[bcfview] 600000 sites processed.
[afs] 0:99758.419 1:160.146 2:81.435
[bcfview] 700000 sites processed.
[afs] 0:99769.451 1:155.968 2:74.581
[bcfview] 800000 sites processed.
[afs] 0:99761.914 1:149.704 2:88.382
[bcfview] 900000 sites processed.
[afs] 0:99746.200 1:158.466 2:95.334
[bcfview] 1000000 sites processed.
[afs] 0:99732.973 1:177.167 2:89.860
[bcfview] 1100000 sites processed.
[afs] 0:99815.394 1:109.322 2:75.284
[bcfview] 1200000 sites processed.
[afs] 0:99762.584 1:160.194 2:77.222
[bcfview] 1300000 sites processed.
[afs] 0:99748.201 1:155.610 2:96.189
[bcfview] 1400000 sites processed.
[afs] 0:99739.710 1:170.437 2:89.853
[bcfview] 1500000 sites processed.
[afs] 0:99729.049 1:176.956 2:93.995
[bcfview] 1600000 sites processed.
[afs] 0:99747.410 1:163.200 2:89.390
[bcfview] 1700000 sites processed.
[afs] 0:99751.841 1:171.932 2:76.228
[bcfview] 1800000 sites processed.
[afs] 0:99800.303 1:145.575 2:54.122
[bcfview] 1900000 sites processed.
[afs] 0:99828.697 1:126.069 2:45.234
[bcfview] 2000000 sites processed.
[afs] 0:99730.074 1:158.242 2:111.684
[afs] 0:87928.796 1:106.271 2:97.933
/usr/local/package/samtools-0.1.18/bcftools/bcftools view variations.bcf > variations.vcf
rm random_1.sai variations.bcf random_2.sai align.bam align.sam

here is the last line of the output:
rm random_1.sai variations.bcf random_2.sai align.bam align.sam
Makefile has removed the intermediate files
The working directory only contain the final files:
$ ls -la
total 168020
drwxr-xr-x  2 lindenb users    4096 Aug 30 17:46 .
drwxr-xr-x 12 lindenb users    4096 Aug 30 14:18 ..
-rw-r--r--  1 lindenb users 81537078 Aug 30 17:43 align.sorted.bam
-rw-r--r--  1 lindenb users    15096 Aug 30 17:43 align.sorted.bam.bai
-rw-r--r--  1 lindenb users  5099956 Aug 30 17:36 chr22.fa
-rw-r--r--  1 lindenb users      181 Aug 30 17:37 chr22.fa.amb
-rw-r--r--  1 lindenb users      41 Aug 30 17:37 chr22.fa.ann
-rw-r--r--  1 lindenb users  5000048 Aug 30 17:37 chr22.fa.bwt
-rw-r--r--  1 lindenb users      22 Aug 30 17:42 chr22.fa.fai
-rw-r--r--  1 lindenb users  1249989 Aug 30 17:37 chr22.fa.pac
-rw-r--r--  1 lindenb users  2500024 Aug 30 17:37 chr22.fa.sa
-rw-r--r--  1 lindenb users    1344 Aug 30 17:35 Makefile
-rw-r--r--  1 lindenb users 37831888 Aug 30 17:36 random_1.fq.gz
-rw-r--r--  1 lindenb users 37836209 Aug 30 17:36 random_2.fq.gz
-rw-r--r--  1 lindenb users  611760 Aug 30 17:46 variations.vcf
-rw-r--r--  1 lindenb users  101626 Aug 30 17:36 wgsim.output

That's it,

Pierre



1 comment:

Anonymous said...

I wrote an assembly evaluation pipeline as a Makefile. Make gives you a lot of stuff for free, but can be somewhat byzantine when you go into the various "special" targets like this.

You can turn your pipeline into an executable command by putting a hashbang at the top. I use "#!/usr/bin/make -rRf" which also turns off builtin rules and variables to speed things up, and give more sensible debugging output (with -d). Also, you can use -j to parallelize things.