21 May 2009

XML Pipelines/ XProc for bioinformatics: my notebook

In this post I describe how I used XProc, the XML "pipeline language" to create a workflow of XML data calling the NCBI for some SNP and building a HTML table describing those markers.
W3C:XProc: (the) XML Pipeline Language, (is) a language for describing operations to be performed on XML documents.

An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output. Pipelines are made up of simple steps which perform atomic operations on XML documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.

The implementation I've choosen is Norman Walsh's XMLCalabash. It seemed to be the de-facto standard implementation. However I found it a little bit slow and I didn't like the fact that it sent 'log' messages to http://xproc.org/. XMLCalabash requires, here, the SAXON and the apache-httpclient libraries.
The XProc language itself was not easy to learn: it is missing some good examples for each feature.

A first workflow


Say , the file "rslist.xml" list of SNP packed in a HTML list:
<ul>
<li>rs25</li>
<li>rs26</li>
</ul>

The folling XProc script reads a XML file and returns the original input.
<p:declare-step name="snp">
<p:documentation>Reads a list of SNP and echoes it</p:documentation>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:identity/>
</p:declare-step>

Here XMLCalabash was called by assigning the port called listOfSnp to our file "rslist.xml"
java -cp ${CLASSPATH} com.xmlcalabash.drivers.Main --input listOfSnp=rslist.xml xproc01.xpl

It returns the original file:
<ul>
<li>rs25</li>
<li>rs26</li>
</ul>

Workflow 2


In this second workflow, we loop over the SNPs and we echo each node. The attribute in <input> @sequence="true" tells xmlcalabash that the result will be a sequence of XML documents.
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:identity/>
</p:for-each>
</p:declare-step>

And here is the result
<li>rs25</li><li>rs26</li>


Workflow 3


This third workflow loops over each SNP, builds a URI for this SNP pointing to its XML definition at the NCBI
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
</p:for-each>
</p:declare-step>

Here is the result, two concatened <ExchangeSet> documents :
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www
.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum htt
p://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDN
A" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
<RsLinkout resourceId="1" linkValue="25"/>
<hgvs>NM_015204.1:c.1454-1398A&gt;G</hgvs>
<hgvs>NT_007819.16:g.11073100T&gt;C</hgvs>
</Rs>
(...)
</ExchangeSet>



<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xm
lns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.ni
h.gov/SNP/docsum http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="26" snpClass="mixed" snpType="notwithdrawn" molType="c
DNA" bitField="030100080001000000000700">
<Validation byCluster="true"/>
<Create build="36" date="2000-09-19 17:02"/>
(...)
<hgvs>NM_015204.1:c.1454-727A&gt;G</hgvs>
<hgvs>NT_007819.16:g.11072429T&gt;C</hgvs>
</Rs>
</ExchangeSet>

Workflow 4


This fourth workflow is the same than the previous one, but it uses <p:unwrap> to remove all the children from <ExchangeSet> for each call at the NCBI, and at the end of the workflow, <wrap-sequence wrapper="ExchangeSet"p> is called to merge all those children in a new <ExchangeSet>
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>

<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>

<p:wrap-sequence wrapper="ncbi:ExchangeSet">
</p:wrap-sequence>
</p:declare-step>


So, at the end, a one and only well defined document is returned:
<ncbi:ExchangeSet xmlns:ncbi="http://www.ncbi.nlm.nih.gov/SNP/docsum">
<Rs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/docsum" rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDNA" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>

(...)

</PrimarySequence>
<RsLinkout resourceId="1" linkValue="26"/>
<hgvs>NM_015204.1:c.1454-727A&gt;G</hgvs>
<hgvs>NT_007819.16:g.11072429T&gt;C</hgvs>
</Rs>

</ncbi:ExchangeSet>

Workflow 5


This time, we well use a list of ENTREZ queries grouped in a HTML list:
<ul>
<li>"snp_gene_clin"[Filter] AND "snp_pubmed_cited"[Filter] AND 2[CHR]</li>
<li>(1000[CHRPOS] : 5000[CHRPOS]) AND 2[CHR] AND "homo sapiens"[Organism]</li>
</ul>

The next workflow calls NCBI ESearch for each query for the SNP database. It then calls NCBI EFetch and retrieve the information about the SNP, using the parameters ( WebEnv and QueryKey) found in the previous EFetch call.
At the end, the ExchangeSet document is transformed into HTML using an XSLT stylesheet defined inline in the workflow:
<p:declare-step name="snp">
<p:input port="queries" primary="true">
</p:input>
<p:output port="result" primary="true"/>

<p:for-each name="loopOverQueries">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="term" select="encode-for-uri(.)"/>
<p:load>
<p:with-option name="href" select="concat('http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&usehistory=y&term=',$term)"/>
</p:load>

<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&WebEnv=', encode-for-uri(/eSearchResult/WebEnv), '&query_key=', encode-for-uri(/eSearchResult/QueryKey) )"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet"/>
<p:xslt name="tr2html">
<p:input port="parameters">
<p:empty/>
</p:input>
<p:input port="stylesheet">
<p:inline>
<xsl:stylesheet version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<html><body><table>
<thead>
<tr>
<th>rs#</th>
<th>dbSnpBuild</th>
<th>group</th>
<th>build</th>
<th>Chrom</th>
<th>Position</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="/ncbi:ExchangeSet/ncbi:Rs">
<xsl:variable name="rsId"><xsl:value-of select="@rsId"/></xsl:variable>
<xsl:for-each select="ncbi:Assembly">
<xsl:variable name="dbSnpBuild"><xsl:value-of select="@dbSnpBuild"/></xsl:variable>
<xsl:variable name="groupLabel"><xsl:value-of select="@groupLabel"/></xsl:variable>
<xsl:variable name="genomeBuild"><xsl:value-of select="@genomeBuild"/></xsl:variable>
<xsl:for-each select="ncbi:Component">
<xsl:variable name="chromosome"><xsl:value-of select="@chromosome"/></xsl:variable>
<xsl:for-each select="ncbi:MapLoc">
<xsl:element name="tr">

<td>
<xsl:element name="a">
<xsl:attribute name="href">http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=<xsl:value-of select="$rsId"/></xsl:attribute>rs<xsl:value-of select="$rsId"/></xsl:element>
</td>
<td><xsl:value-of select="$dbSnpBuild"/></td>
<td><xsl:value-of select="$groupLabel"/></td>
<td><xsl:value-of select="$genomeBuild"/></td>
<td>chr<xsl:value-of select="$chromosome"/></td>
<td><xsl:value-of select="@physMapInt"/></td>
</xsl:element>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</tbody>
</table></body></html>
</xsl:template>
</xsl:stylesheet>
</p:inline>
</p:input>
</p:xslt>


</p:declare-step>

And here is the HTML returned:


rs#dbSnpBuildgroupbuildChromPosition
rs28928870129Celera36_3chr249031258
rs28928870129HuRef36_3chr248924538
rs28928870129reference36_3chr249044117
(...)
rs10168026129Celera36_3chr271241
rs10168026129HuRef36_3chr24967
rs10168026129reference36_3chr24484


I've uploaded this workflow in myExperiment:

to my knowledge, this is the first XProc script available there.

That's it !
Pierre

3 comments:

dalloliogm said...

I prefer Makefiles over xml workflows, because they are a lot easier to write (at least for the day-to-day pipelines).

In our laboratory, we have also had some discussion over this. Have a look at these slides I have prepared to introduce make to the others:
- http://bioinfoblog.it/2009/03/seminar-on-makefiles-in-bioinformatics/

cheers :)

Norman Walsh said...

The "phone home" part is optional, you can completely disable it with -Dcom.xmlcalabash.phonehome=false

Pierre Lindenbaum said...

good to know. Thank you Norman