27 February 2013

4 Tools I wrote today to annotate VCF files.

Here are four java-bases tools I wrote today for @SolenaLS to annotate VCF files.

VCFTabix

VCFTabix : we want to use the VCF of the 1Kgenomes project (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) as a source of annotation for our VCF.

Options

-f (vcf indexed with tabix) REQUIRED.
 -T (tag String) VCF-INFO-ID optional can be used several times.
 -R doesn't use REF allele
 -A doesn't use ALT allele
 -I don't replace ID if it exists.
 -F don't replace INFO field if it exists.
 -C (TAG) use this tag in case of conflict with the ALT allele.

Example:

java -jar dist/vcftabix.jar -f ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz \
   -T AC -T SNPSOURCE  input.vcf 

##fileformat=VCFv4.1
(....)
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabix.VCFTabix:ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CD10977
5       2999819 rs21234 A       G       1620    .       AC=2156;AF=1.00;AN=2;DB;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.44;MQ0=0;QD=39.51;SB=-6.519e-03;SNPSOURCE=LOWCOV;       GT:AD:DP:GQ:PL  1/1:0,41:41:99:1653,123,0
(...)


EVS2BED

EVS2BED: bulk download of the XML data of EVS into a BED file CHROM/START/END/XML (see my related post: http://plindenbaum.blogspot.fr/2012/05/downloading-xml-data-from-exome-variant.html )

Example

Download 10 records and index the data with tabix:
java -jar dist/evs2bed.jar -L 10 | LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
  /path/totabix-0.2.6/bgzip -c > test.bed.gz
/path/totabix-0.2.6/tabix -p bed -f test.bed.gz


$ gunzip -c  test.bed.gz  | cut -c 1-500 | fold -w 80
1 69427 69428 <snpList><positionString>1:69428</positionString><chrPos
ition>69428</chrPosition><alleles>G/T</alleles><uaAlleleCounts>G=313/T=6535</uaA
lleleCounts><aaAlleleCounts>G=14/T=3808</aaAlleleCounts><totalAlleleCounts>G=327
/T=10343</totalAlleleCounts><uaMAF>4.5707</uaMAF><aaMAF>0.3663</aaMAF><totalMAF>
3.0647</totalMAF><avgSampleReadDepth>110</avgSampleReadDepth><geneList>OR4F5</ge
neList><snpFunction><chromosome>1</chromosome><position>69428</position><conserv
ationScore>1.0</conservationSc
1 69475 69476 <snpList><positionString>1:69476</positionString><chrPos
ition>69476</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=7020</uaAll
eleCounts><aaAlleleCounts>C=0/T=3908</aaAlleleCounts><totalAlleleCounts>C=2/T=10
928</totalAlleleCounts><uaMAF>0.0285</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0183</
totalMAF><avgSampleReadDepth>123</avgSampleReadDepth><geneList>OR4F5</geneList><
snpFunction><chromosome>1</chromosome><position>69476</position><conservationSco
re>0.6</conservationScore><con
1 69495 69496 <snpList><positionString>1:69496</positionString><chrPos
ition>69496</chrPosition><alleles>A/G</alleles><uaAlleleCounts>A=2/G=6764</uaAll
eleCounts><aaAlleleCounts>A=23/G=3785</aaAlleleCounts><totalAlleleCounts>A=25/G=
10549</totalAlleleCounts><uaMAF>0.0296</uaMAF><aaMAF>0.604</aaMAF><totalMAF>0.23
64</totalMAF><avgSampleReadDepth>91</avgSampleReadDepth><geneList>OR4F5</geneLis
t><snpFunction><chromosome>1</chromosome><position>69496</position><conservation
Score>0.5</conservationScore><
1 69510 69511 <snpList><positionString>1:69511</positionString><chrPos
ition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=5337/A=677</uaA
lleleCounts><aaAlleleCounts>G=1937/A=1623</aaAlleleCounts><totalAlleleCounts>G=7
274/A=2300</totalAlleleCounts><uaMAF>11.2571</uaMAF><aaMAF>45.5899</aaMAF><total
MAF>24.0234</totalMAF><avgSampleReadDepth>69</avgSampleReadDepth><geneList>OR4F5
</geneList><snpFunction><chromosome>1</chromosome><position>69511</position><con
servationScore>1.0</conservati
1 69589 69590 <snpList><positionString>1:69590</positionString><chrPos
ition>69590</chrPosition><alleles>A/T</alleles><uaAlleleCounts>A=0/T=6214</uaAll
eleCounts><aaAlleleCounts>A=1/T=3555</aaAlleleCounts><totalAlleleCounts>A=1/T=97
69</totalAlleleCounts><uaMAF>0.0</uaMAF><aaMAF>0.0281</aaMAF><totalMAF>0.0102</t
otalMAF><avgSampleReadDepth>119</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69590</position><conservationScor
e>0.8</conservationScore><cons
1 69593 69594 <snpList><positionString>1:69594</positionString><chrPos
ition>69594</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=6190</uaAll
eleCounts><aaAlleleCounts>C=0/T=3548</aaAlleleCounts><totalAlleleCounts>C=2/T=97
38</totalAlleleCounts><uaMAF>0.0323</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0205</t
otalMAF><avgSampleReadDepth>109</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69594</position><conservationScor
e>0.8</conservationScore><cons
1 69619 69620 <snpList><positionString>1:69620</positionString><chrPos
ition>69620</chrPosition><alleles>T/TA</alleles><uaAlleleCounts>A1=2/R=4694</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2954</aaAlleleCounts><totalAlleleCounts>A1=2/
R=7648</totalAlleleCounts><uaMAF>0.0426</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.026
1</totalMAF><avgSampleReadDepth>56</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69620</position><conservationS
core>0.7</conservationScore><c
1 69744 69745 <snpList><positionString>1:69745</positionString><chrPos
ition>69745</chrPosition><alleles>CA/C</alleles><uaAlleleCounts>A1=4/R=4254</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2716</aaAlleleCounts><totalAlleleCounts>A1=4/
R=6970</totalAlleleCounts><uaMAF>0.0939</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.057
4</totalMAF><avgSampleReadDepth>10</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69745</position><conservationS
core>0.9</conservationScore><c
1 69760 69761 <snpList><positionString>1:69761</positionString><chrPos
ition>69761</chrPosition><alleles>T/A</alleles><uaAlleleCounts>T=645/A=4093</uaA
lleleCounts><aaAlleleCounts>T=62/A=2840</aaAlleleCounts><totalAlleleCounts>T=707
/A=6933</totalAlleleCounts><uaMAF>13.6133</uaMAF><aaMAF>2.1365</aaMAF><totalMAF>
9.2539</totalMAF><avgSampleReadDepth>8</avgSampleReadDepth><geneList>OR4F5</gene
List><snpFunction><chromosome>1</chromosome><position>69761</position><conservat
ionScore>0.1</conservationScor

VCFTabixml

VCFTabixml retrieves the annotations from a BED (CHROM/START/END/XML) and, using XLST, annotates a VCF.

Example

In the following example, a few variations are downloaded from EVS using EVS2BED. Then a small VCF is created and annotated using the bed and the following XSLT stylesheet: evs2vcf.xsl.
java -jar dist/evs2bed.jar -L 10  |\
       LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
       tabix-0.2.6/bgzip -c > test.bed.gz

/tabix-0.2.6/tabix -p bed -f test.bed.gz
java -jar  dist/vcftabixml.jar \
   -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf



echo "#CHROM POS ID REF ALT QUAL FILTER INFO" > test.vcf
echo "1 1 . C T . . T=test1" >> test.vcf
echo "1 69476 . T C . . T=test2" >> test.vcf
echo "1 69511 . A C . . T=test3" >> test.vcf 
java -jar  dist/vcftabixml.jar \
    -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf

Result:
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabixml.VCFTabixml:test.bed.gz
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T . . T=test1;
1 69476 . T C . . T=test2;EVS_UAMAF=0.0285;EVS_AAMAF=0.0;EVS_TOTALMAF=0.0183;EVS_AVGSAMPLEREADDEPTH=123;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=0.6;EVS_CONSERVATIONSCOREGERP=2.3;EVS_RSIDS=rs148502021;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;
1 69511 . A C . . T=test3;EVS_UAMAF=11.2571;EVS_AAMAF=45.5899;EVS_TOTALMAF=24.0234;EVS_AVGSAMPLEREADDEPTH=69;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=1.0;EVS_CONSERVATIONSCOREGERP=1.1;EVS_RSIDS=rs75062661;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;EVS_CONFLICTALT=G;

VCFBigWig

VCFBigWig use a bigwig file to annotate a VCF file.

Example:

add the GERP score to a VCF:
$ java -jar dist/vcfbigwig.jar -f /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw  test.vep.vcf
##fileformat=VCFv4.1
##Annotated with class fr.inserm.umr1087.jvarkit.tools.vcfbigwig.VCFBigWig:/commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw
##INFO=<ID=All_hg19_RS_noprefix,Number=1,Type=Float,Description="Annotations from /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
22      12345   .       G       T       100.0   .       All_hg19_RS_noprefix=4.829999923706055

4 comments:

István Albert said...

Only four? The whole day? :-)

Pierre Lindenbaum said...

well... I started yesterday. The sources are small, that was not Rocket Science :-)

Knaus said...

why you no try www.gene-talk.de ?

there you can annotate, communicate, filter, and so on...

Knaus said...

Dear Pierre Lindenbaum, I admire your work, therefore I would like to know what you think about GeneTalk.
It would be nice if you would visit www.gene-talk.de and give some critics.

thanks

Alex