22 February 2011

A flex scanner extracting the metadata from a PDF file.

4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.

Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
strings |\
grep -A 100 "<x:xmpmeta"


I've generalized this process by implementing a GNU-flex scanner that prints the XML content between two <xmp:xmpmeta/> tags. The source code is available on github at: https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.

Compilation

flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c

Testing with "PLOS"

Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:

curl -s "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0016517&representation=PDF" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP toolkit 2.9.1-13, framework 1.6">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:iX="http://ns.adobe.com/iX/1.0/">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" pdf:Producer="Acrobat Distiller 7.0 (Windows)"/>
<rdf:Description xmlns:xap="http://ns.adobe.com/xap/1.0/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xap:CreateDate="2011-02-14T08:21:58+08:00" xap:CreatorTool="3B2 Total Publishing System 7.51n/W" xap:ModifyDate="2011-02-17T14:32:08+08:00" xap:MetadataDate="2011-02-17T14:32:08+08:00"/>
<rdf:Description xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xapMM:DocumentID="uuid:f98295b5-0980-42f2-884e-6cecd2d75c90" xapMM:InstanceID="uuid:d226393a-bfaf-4a2b-8c28-08215008694e"/>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" dc:format="application/pdf">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">pone.0016517 1..16</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

Hum, nothing really interesting here.

Testing with "Nature Reviews Cardiology"

A test with Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor


<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">&#xA; © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

That's more interesting isn't it ?

That's it,

Pierre

5 comments:

Fergus Gallagher said...

"pdfinfo -meta" does the job too.

Pierre Lindenbaum said...

@Fergus ah yes! Thanks, I didn't know pdfinfo :-)

malcook said...

or try this!

sudo cpan Image::ExifTool
...
Image::ExifTool is up to date (8.40).
> curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" | exiftool -scanForXMP -
ExifTool Version Number : 8.40
File Type : PDF
MIME Type : application/pdf
PDF Version : 1.6
Tagged PDF : Yes
XMP Toolkit : Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00
About : doi:10.1038/nrcardio.2010.184
Format : application/pdf
Description : Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
Publisher : Nature Publishing Group
Rights : . © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.
Copyright : © 2010 Nature Publishing Group
Digital Object Identifier : 10.1038/nrcardio.2010.184
E Issn : 1759-5010
Ending Page : 90
ISSN : 1759-5002
Number : 2
Publication Name : Nature Publishing Group
Rights Agent : permissions@nature.com
Starting Page : 79
Volume : 8
Publication Date : 2010:12:07
URL : http://dx.doi.org/10.1038/nrcardio.2010.184
Creator Tool :
Label : Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
Metadata Date : 2011:01:14 18:25:45+05:30
Identifier : doi:10.1038/nrcardio.2010.184
Marked : True
Document ID : uuid:51421664-c9c6-4657-9fbe-318ac969ca26
Instance ID : uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e
Page Count : 12
Author : Tracy Y. Wang
Create Date : 2011:01:10 10:09:23+05:30
Creator :
Modify Date : 2011:01:14 18:25:45+05:30
Producer : Adobe PDF Library 8.0
Subject : Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
Title : Percutaneous coronary intervention in the elderly
mp0717l8ya:~ mec$

Egon Willighagen said...

Strigi can do it too:

xmlindexer nrcardio.2010.184.pdf | xpath -e "string(//text[contains(., \"RDF\")])"

but I have not figured out yet how xmlindexer can index input from STDIN...

Egon Willighagen said...

Strigi can do it too:

xmlindexer nrcardio.2010.184.pdf | xpath -e "string(//text[contains(., \"RDF\")])"

but I have not figured out yet how xmlindexer can index input from STDIN...