✨ Getting started

Welcome to the official documentation of FileGazer – your open source platform for in-depth file analysis, classification, and automated workflows.

✨ Introduction

FileGazer is a lightweight, modular file analysis tool. It uses varioMore information can be found in the "inspectors" to extract content, metadata, barcodes, hashes, and much more.

You can define your own classification rules, integrate Groovy scripts, and create REST-based analysis processes.

When you call the FileGazer Service you setup which of the inspector should be active and you get back a xml with all information FileGazer can find. If needed, you define a xslt file to transform FIleGazer Result XML to the out you need. Then Filegazer will server you exactly what you need.

🔬 Online Test Environment

🌐 Live Demo – Try FileGazer Instantly! Curious how FileGazer works without installing anything?
Head over to the online test environment:

👉 🔬 FileGazer Demo-Site

Includes pre-installed modules: Gotenberg & Tesseract
Great for quick testing, debugging, and ad-hoc file analysis.

💬 If you experience issues: [email protected]

🎓 Installation & Setup

Requirements: Java 21, possibly Tesseract OCR, Docker (optional)

Start JAR:

java -jar FileGazer-1.0.3.jar

Start Docker:

docker run -d -p 8080:8080 samoak/filegazer:latest

Try

http://localhost:8080

to analyse some files and check how the FileGazer Result XML look likes.

📁 Inspectors

Inspectors are specialized modules that examine specific aspects of a file:

BASE: Basic data (Name, Size, Timestamps)
MAGICNUMBER: Filetype based on the Magic Bytes
HASHCODE: Checksum (SHA256, MD5, etc.)
BARCODE: Barcode-Scan (PDF, PNG, JPG)
TIKA: Content & metadata via Apache Tika (+ OCR with Tesseract)
CONTENTANALYSE: Rule-based document classification & indexing
AIANALYSE: AI classifikation and extraction with multiple AI Provider (openai,claude, gemini, ollama,...)
PDFANALYSER: PDF analysing and PDF/A checking
PDFCONVERT: Converts Office, image formats, etc. to PDF/A (via Gotenberg)
EUCOREINVOICE: Check PDF for XRechnung/ZUGFeRD

🔍 Analysing file/documentsn

Upload-Modus:

curl --request POST "http://localhost:8080/file/inspect/upload" \
  --form "file=@/pfad/zur/datei.pdf" \
  --form "inspector=TIKA,CONTENTANALYSE" \

local mode:

curl --request POST "http://localhost:8080/file/inspect/local?inspector=BASE,TIKA"
  --data-raw "/pfad/zur/datei.txt"

🏃 Automating & Scripting

Groovy scripts can run at startup, at shutdown, or cyclically via cron expressions.
Scripts can be prepared (Prepare) or validated (Validate)
Configured via: FileGazerScripts.xml and FileGazerContentAnalyse.xml

⚙️ REST-API

Endpoint	Description
`/file/inspect/upload`	File upload & analye
`/file/inspect/local`	Analyse local file
`/execute/{script}`	Execute script

📋 PDF/A & Converting

Supports PDF/A-1b, PDF/A-2b, and PDF/A-3b
Supported formats: PDF, DOCX, PNG, TIFF, ODT, HTML, and many more
Conversion via the gotenberg service
Reads all PDF infos

📚 Examples

Invoice Recognition: Classifies PDFs as invoices, delivery notes, etc.
Barcode Indexing: Extracts QR codes and saves them as metadata
OCR Archiving: Automatic full-text recognition and PDF/A generation
E-Rechnung: Validate PDF File for E-Rechnung and extract full ZUGFeRD xml for further processing

Docker

In addition to the JAR file, Filegazer can also be loaded and started as a Docker image. The latest image is available for download at

https://hub.docker.com/r/samoak/filegazer

This image contains both the correct Java version and Tesseract installation.

docker run -d -v /home/myUser/filegazer/log:/home/filegazer/log -v /home/myUser/filegazer/processing:/home/filegazer/processing -p 8080:8080 filegazer:latest

This copies the two directories /home/filegazer/log and /hme/filegazer/processing from the container to the host.

Docker-Compose

The FileGazer Docker image contains Tesseract and all the settings needed to get Filegazer up and running. What this image does not include is "Gotenberg." This software is available exclusively as a Docker image. The following Docker Compose file starts both Filegazer and Gotenberg as individual containers.

version: '3.8'
services:

  gotenberg:
    container_name: gotenberg
    image: docker.io/gotenberg/gotenberg:latest
    ports:
      - '3000:3000'
    restart: unless-stopped
    # The gotenberg chromium route is used to convert .eml files. We do not
    # want to allow external content like tracking pixels or even javascript.
    command:
      - "gotenberg"
      - "--chromium-disable-javascript=true"
      - "--chromium-allow-list=file:///tmp/.*"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  filegazer:
    depends_on:
      gotenberg:
        condition: service_healthy
    container_name: filegazer
    image: samoak/filegazer:latest
    environment:
      - OPENAI_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
      - GEMINI_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
      - CLAUDE_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
    ports:
      - '8080:8080'
    # Adjust directory for your need
    volumes:
      - ./filegazer/log:/home/filegazer/log
      - ./filegazer/processing:/home/filegazer/processing
      - ./filegazer/etc:/home/filegazer/etc
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
      interval: 10s
      timeout: 5s
      retries: 5

networks:
  default:
    name: filegazer_net

Start the containers with:

sudo docker-compose up -d

✨ Introduction​

🔬 Online Test Environment​

🎓 Installation & Setup​

Start JAR:​

Start Docker:​

📁 Inspectors​

🔍 Analysing file/documentsn​

Upload-Modus:​

local mode:​

🏃 Automating & Scripting​

⚙️ REST-API​

📋 PDF/A & Converting​

📚 Examples​

Docker​

Docker-Compose​