In previous posts, we explored embedding documents within PDF/A-3 files using 4D Write Pro, including generating electronic invoices. As e-invoicing becomes mandatory in various European countries—such as France and Germany—the ability to extract embedded XML files from these PDFs is becoming crucial. But the utility of PDF/A-3 extends beyond invoices; these files can embed various document types that may require extraction.
While tools like Adobe Acrobat Reader offer manual extraction, this post introduces an efficient, automated method using 4D 20 R6 to easily handle the process.
A dedicated component
A dedicated component has been developed to allow you to extract all enclosure files (in memory) and manage them the way you need once you have done so. For example, an XML file can be parsed directly into a DOM tree, while a picture file can be saved to disk, etc.
COMPONENT MANAGER
This component can be easily installed thanks to the component manager which is really simple to use and part of 4D 20 R6. Just create a dependencies.json file inside your Sources folder and type this simple lines inside. The component will be installed as soon as your project is launched. The provided HDI is based on this mechanism.
{
"dependencies":
{
"4D-QPDF":
{
"github": "4d/4D-QPDF",
"version":"*"
}
}
}
Extraction
A single method from the component will allow you to get the list and the content of the included documents inside a PDF file.
$colAttachments:=PDF Get attachments ($PDFfile)
This method returns a collection containing objects, each describing and containing an attachment.
They will contain attributes like name, extension, fullName, mimeType, content, etc., fully described in the HDI and in the component documentation.
$path:="/DATA/TestPDFs/ManyEnclosures.pdf"
$PDFfile:=File($path; fk posix path)
Form.attachments:=PDF Get attachments($PDFfile)
Of course, in the case of electronic invoices, you can go straight to the included xml element!
$XMLattachments:=PDF Get attachments($file).query("mimeType = :1"; "text/xml")
If ($XMLattachments.length#0)
$xml:=BLOB to text($XMLattachments[0].content; UTF8 text without length)
$dom:=DOM Parse XML variable($xml)
End if
The component documentation on GitHub describes more functions (check, update, etc.) in detail.
An example of an extraction dialog
The provided HDI will demonstrate how to display the enclosures in a listbox based on the returned collection.
Conclusion
This component, with its source code available on GitHub, leverages the QPDF library (an open-source solution freely available here) to provide a reliable method for extracting attachments from PDF/A-3 files within your 4D applications. It’s ready to use out of the box via the component manager, with flexibility for customization as your needs grow or the QPDF library evolves.