✨ Getting started
Welcome to the official documentation of FileGazer – your open source platform for in-depth file analysis, classification, and automated workflows.
✨ Introduction
FileGazer is a lightweight, modular file analysis tool. It uses varioMore information can be found in the "inspectors" to extract content, metadata, barcodes, hashes, and much more.
You can define your own classification rules, integrate Groovy scripts, and create REST-based analysis processes.
When you call the FileGazer Service you setup which of the inspector should be active and you get back a xml with all information FileGazer can find. If needed, you define a xslt file to transform FIleGazer Result XML to the out you need. Then Filegazer will server you exactly what you need.
🔬 Online Test Environment
🌐 Live Demo – Try FileGazer Instantly!
Curious how FileGazer works without installing anything?
Head over to the online test environment:
Includes pre-installed modules:
Gotenberg&Tesseract
Great for quick testing, debugging, and ad-hoc file analysis.
💬 If you experience issues: [email protected]
🎓 Installation & Setup
Requirements: Java 21, possibly Tesseract OCR, Docker (optional)
Start JAR:
java -jar FileGazer-1.0.3.jar
Start Docker:
docker run -d -p 8080:8080 samoak/filegazer:latest
Try
http://localhost:8080
to analyse some files and check how the FileGazer Result XML look likes.
📁 Inspectors
Inspectors are specialized modules that examine specific aspects of a file:
BASE: Basic data (Name, Size, Timestamps)MAGICNUMBER: Filetype based on the Magic BytesHASHCODE: Checksum (SHA256, MD5, etc.)BARCODE: Barcode-Scan (PDF, PNG, JPG)TIKA: Content & metadata via Apache Tika (+ OCR with Tesseract)CONTENTANALYSE: Rule-based document classification & indexingAIANALYSE: AI classifikation and extraction with multiple AI Provider (openai,claude, gemini, ollama,...)PDFANALYSER: PDF analysing and PDF/A checkingPDFCONVERT: Converts Office, image formats, etc. to PDF/A (via Gotenberg)EUCOREINVOICE: Check PDF for XRechnung/ZUGFeRD
🔍 Analysing file/documentsn
Upload-Modus:
curl --request POST "http://localhost:8080/file/inspect/upload" \
--form "file=@/pfad/zur/datei.pdf" \
--form "inspector=TIKA,CONTENTANALYSE" \
local mode:
curl --request POST "http://localhost:8080/file/inspect/local?inspector=BASE,TIKA"
--data-raw "/pfad/zur/datei.txt"
🏃 Automating & Scripting
- Groovy scripts can run at startup, at shutdown, or cyclically via cron expressions.
- Scripts can be prepared (
Prepare) or validated (Validate) - Configured via:
FileGazerScripts.xmlandFileGazerContentAnalyse.xml
⚙️ REST-API
| Endpoint | Description |
|---|---|
/file/inspect/upload | File upload & analye |
/file/inspect/local | Analyse local file |
/execute/{script} | Execute script |
📋 PDF/A & Converting
- Supports PDF/A-1b, PDF/A-2b, and PDF/A-3b
- Supported formats: PDF, DOCX, PNG, TIFF, ODT, HTML, and many more
- Conversion via the
gotenbergservice - Reads all PDF infos
📚 Examples
- Invoice Recognition: Classifies PDFs as invoices, delivery notes, etc.
- Barcode Indexing: Extracts QR codes and saves them as metadata
- OCR Archiving: Automatic full-text recognition and PDF/A generation
- E-Rechnung: Validate PDF File for E-Rechnung and extract full ZUGFeRD xml for further processing
Docker
In addition to the JAR file, Filegazer can also be loaded and started as a Docker image. The latest image is available for download at
https://hub.docker.com/r/samoak/filegazer
This image contains both the correct Java version and Tesseract installation.
docker run -d -v /home/myUser/filegazer/log:/home/filegazer/log -v /home/myUser/filegazer/processing:/home/filegazer/processing -p 8080:8080 filegazer:latest
This copies the two directories /home/filegazer/log and /hme/filegazer/processing from the container to the host.
Docker-Compose
The FileGazer Docker image contains Tesseract and all the settings needed to get Filegazer up and running. What this image does not include is "Gotenberg." This software is available exclusively as a Docker image. The following Docker Compose file starts both Filegazer and Gotenberg as individual containers.
version: '3.8'
services:
gotenberg:
container_name: gotenberg
image: docker.io/gotenberg/gotenberg:latest
ports:
- '3000:3000'
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- "gotenberg"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 5
filegazer:
depends_on:
gotenberg:
condition: service_healthy
container_name: filegazer
image: samoak/filegazer:latest
environment:
- OPENAI_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
- GEMINI_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
- CLAUDE_API_KEY=xxxxxxreplace_with_your_own_xxxxxxxxx
ports:
- '8080:8080'
# Adjust directory for your need
volumes:
- ./filegazer/log:/home/filegazer/log
- ./filegazer/processing:/home/filegazer/processing
- ./filegazer/etc:/home/filegazer/etc
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
interval: 10s
timeout: 5s
retries: 5
networks:
default:
name: filegazer_net
Start the containers with:
sudo docker-compose up -d