Skip to main content

FileGazer LogoFileGazer

Deep File Inspection Service. Modular REST-Service for OCR/Text & Metadaten (Tika + OCR), Barcodes, Hashes, PDF/A-Checks und conversation – with rulebased classifikation & extraction as well as AI-powered data extraction from documents (AI Inspector).

Open SourceREST APIDockerNiFiAI Extraction

What you can solve with it

FileGazer is modular: choose the right inspectors per file and get a structured analysis result.

ECM/DMS & Archiving

Ingest documents, extract OCR/text & metadata, find index fields (e.g., invoice no., date, IBAN). Validate PDF/A and convert when needed.

Automation (NiFi / ETL)

Use FileGazer as a building block in flows: classify, extract, and route. Ideal for scan ingest, mailroom, and batch processing.

Compliance & Quality

PDF/A validation, hashes for integrity, magic number / file type detection. Reproducible checks for audits and pipelines.

Developer Workflow

REST API returns analysis as XML. Transform via XSLT to JSON/CSV or a custom format. Extensible via the inspector concept.

Get your first result in 5 minutes

Start FileGazer (and optionally Gotenberg for conversions), mount your folder and run an inspection.

1
Start Docker Compose
docker compose up -d
2
Mount your files
Create e.g. ./data and mount it to /data.
3
Run an inspection
curl -X POST "http://localhost:8080/file/inspect/local?inspector=TIKA"   --data-raw "/data/test.pdf"

Typical outputs

  • Text & Metadaten (Tika + optional OCR)
  • Barcodes & QR Codes
  • PDF/A Status & PDF/A validation
  • Hashes / Checksums
  • classification & extraction by Rules
  • AI-based structured data extraction (AI Inspector)

Inspector-Overview

Choose the appropriate inspectors for your use case: OCR/Tika, barcodes, checksums, PDF/A, conversion, classification and AI extraction.

barcode1

Inspector: Barcode

The inspector can read barcodes from documents, which is useful for identifying and categorizing documents quickly. This feature supports various barcode formats, enhancing document management efficiency.

Inspectort: OCR and Tika

With the help of Tika and Tesseract, all metadata and the "content" of the file/document are extracted and analyzed. This enables full-text search, metadata inspection, and advanced document processing.

Inspector: Classification and extraction

Based on the document content, the inspector can classify the document and extract specific information. This is useful for automating workflows and processing documents based on their content.

Inspector: AI Analysis

The Inspector uses the text recognized via OCR to send one or more queries to various AI providers (e.g., OpenAI, Claude, Gemini, Ollama, or LM Studio). This allows targeted information to be extracted from the document—such as invoice data, amounts, IBANs, summaries, or classifying characteristics—and provided as structured results for downstream workflows.

Inspector: PDF Analysing

Analyse a PDF document and read all available metadata. This includes checking for compliance with PDF standards (PDF/A), FormFields, "Fast View", Viewer Setting, Encryption and other elements from the PDF, and providing insights into the document's structure and content.

Inspector: Mimetype

The inspector identifies the mimetype of the document, which is essential for understanding the type of content and how it should be processed. This helps in categorizing and managing documents effectively.

file-check

Inspector: Checksum

The inspector calculates a checksum for the document, which is important for verifying the integrity of the file. This ensures that the document has not been altered or corrupted during processing or storage.

Inspector: PDF converting

The inspector can convert documents to PDF format, which is a widely used format for sharing and archiving documents. This conversion ensures that the document is accessible and maintains its formatting across different platforms. This is done with the help of www.gotenberg.org, which provides a powerful API for document conversion.

Inspector: Base information

The inspector collects basic information about the document, such as its size, timestamps, content and creation date. This information is essential for managing and organizing documents effectively.

Inspector: EU Core Invoice (XRechnung)

This inspector checks whether a PDF document complies with the standards for a standardized, machine-readable data format for electronic invoices and extracts all invoice data.