PDF/A3 Attachment Extraction

In previous posts, we explored embedding documents within PDF/A-3 files using 4D Write Pro, including generating electronic invoices. As e-invoicing becomes mandatory in various European countries—such as France and Germany—the ability to extract embedded XML files from these PDFs is becoming crucial. But the utility of PDF/A-3 extends beyond invoices; these files can embed various document types that may require extraction.

While tools like Adobe Acrobat Reader offer manual extraction, this post introduces an efficient, automated method using 4D 20 R6 to easily handle the process.

HDI: PDF/A3 File Extraction

A dedicated component

A dedicated component has been developed to allow you to extract all enclosure files (in memory) and manage them the way you need once you have done so. For example, an XML file can be parsed directly into a DOM tree, while a picture file can be saved to disk, etc.

COMPONENT MANAGER

This component can be easily installed thanks to the component manager which is really simple to use and part of 4D 20 R6. Just create a dependencies.json file inside your Sources folder and type this simple lines inside. The component will be installed as soon as your project is launched. The provided HDI is based on this mechanism.

{
"dependencies":
 {
  "4D-QPDF":
  {
  "github": "4d/4D-QPDF",
  "version":"*"
  }
 }
}

 

Extraction

A single method from the component will allow you to get the list and the content of the included documents inside a PDF file.

$colAttachments:=PDF Get attachments ($PDFfile)

This method returns a collection containing objects, each describing and containing an attachment.

They will contain attributes like name, extension, fullName, mimeType, content, etc., fully described in the HDI and in the component documentation.

    $path:="/DATA/TestPDFs/ManyEnclosures.pdf"
    $PDFfile:=File($path; fk posix path)
    Form.attachments:=PDF Get attachments($PDFfile)

    Of course, in the case of electronic invoices, you can go straight to the included xml element!

    $XMLattachments:=PDF Get attachments($file).query("mimeType = :1"; "text/xml")
    If ($XMLattachments.length#0)
    	$xml:=BLOB to text($XMLattachments[0].content; UTF8 text without length)
    	$dom:=DOM Parse XML variable($xml)
    End if 

    The component documentation on GitHub describes more functions (check, update, etc.) in detail.

    An example of an extraction dialog

    The provided HDI will demonstrate how to display the enclosures in a listbox based on the returned collection.

     

    Conclusion

    This component, with its source code available on GitHub, leverages the QPDF library (an open-source solution freely available here) to provide a reliable method for extracting attachments from PDF/A-3 files within your 4D applications. It’s ready to use out of the box via the component manager, with flexibility for customization as your needs grow or the QPDF library evolves.

    Roland Lannuzel
    • Product Owner & 4D Expert •After studying electronics, Roland went into industrial IT as a developer and consultant, building solutions for customers with a variety of databases and technologies. In the late 80’s he fell in love with 4D and has used it in writing business applications that include accounting, billing and email systems.Eventually joining the company in 1997, Roland’s valuable contributions include designing specifications, testing tools, demos as well as training and speaking to the 4D community at many conferences. He continues to actively shape the future of 4D by defining new features and database development tools.