What you can solve with it
FileGazer is modular: choose the right inspectors per file and get a structured analysis result.
ECM/DMS & Archiving
Ingest documents, extract OCR/text & metadata, find index fields (e.g., invoice no., date, IBAN). Validate PDF/A and convert when needed.
Automation (NiFi / ETL)
Use FileGazer as a building block in flows: classify, extract, and route. Ideal for scan ingest, mailroom, and batch processing.
Compliance & Quality
PDF/A validation, hashes for integrity, magic number / file type detection. Reproducible checks for audits and pipelines.
Developer Workflow
REST API returns analysis as XML. Transform via XSLT to JSON/CSV or a custom format. Extensible via the inspector concept.
Get your first result in 5 minutes
Start FileGazer (and optionally Gotenberg for conversions), mount your folder and run an inspection.
docker compose up -d
./data and mount it to /data.curl -X POST "http://localhost:8080/file/inspect/local?inspector=TIKA" --data-raw "/data/test.pdf"
Typical outputs
- Text & Metadaten (Tika + optional OCR)
- Barcodes & QR Codes
- PDF/A Status & PDF/A validation
- Hashes / Checksums
- classification & extraction by Rules
- AI-based structured data extraction (AI Inspector)
Inspector-Overview
Choose the appropriate inspectors for your use case: OCR/Tika, barcodes, checksums, PDF/A, conversion, classification and AI extraction.
Inspector: Barcode
The inspector can read barcodes from documents, which is useful for identifying and categorizing documents quickly. This feature supports various barcode formats, enhancing document management efficiency.
Inspectort: OCR and Tika
With the help of Tika and Tesseract, all metadata and the "content" of the file/document are extracted and analyzed. This enables full-text search, metadata inspection, and advanced document processing.
Inspector: Classification and extraction
Based on the document content, the inspector can classify the document and extract specific information. This is useful for automating workflows and processing documents based on their content.
Inspector: AI Analysis
The Inspector uses the text recognized via OCR to send one or more queries to various AI providers (e.g., OpenAI, Claude, Gemini, Ollama, or LM Studio). This allows targeted information to be extracted from the document—such as invoice data, amounts, IBANs, summaries, or classifying characteristics—and provided as structured results for downstream workflows.
Inspector: PDF Analysing
Analyse a PDF document and read all available metadata. This includes checking for compliance with PDF standards (PDF/A), FormFields, "Fast View", Viewer Setting, Encryption and other elements from the PDF, and providing insights into the document's structure and content.
Inspector: Mimetype
The inspector identifies the mimetype of the document, which is essential for understanding the type of content and how it should be processed. This helps in categorizing and managing documents effectively.
Inspector: Checksum
The inspector calculates a checksum for the document, which is important for verifying the integrity of the file. This ensures that the document has not been altered or corrupted during processing or storage.
Inspector: PDF converting
The inspector can convert documents to PDF format, which is a widely used format for sharing and archiving documents. This conversion ensures that the document is accessible and maintains its formatting across different platforms. This is done with the help of www.gotenberg.org, which provides a powerful API for document conversion.
Inspector: Base information
The inspector collects basic information about the document, such as its size, timestamps, content and creation date. This information is essential for managing and organizing documents effectively.
Inspector: EU Core Invoice (XRechnung)
This inspector checks whether a PDF document complies with the standards for a standardized, machine-readable data format for electronic invoices and extracts all invoice data.
