ML-Powered PDF Redaction for Java — Remove PII from Any PDF
Permanent PII removal with audit trails. ML-powered detection across 20+ entity types with confidence scoring. HIPAA, GDPR, CCPA compliant.
The Problem
Why PDF Redaction Is Harder Than It Looks
Finding PII is hard. Regex catches patterns like SSNs, but misses context-dependent data like names and addresses. You need ML to close that gap.
Removing it is harder. PDFs weren't built for editing — what looks like "John Smith" on screen might be scattered across multiple internal objects. Most tools just draw black boxes over text, but the original content stays in the file.
Get either side wrong and you have a compliance gap.
The Limitations
- Pattern matching alone misses context-dependent PII like names and addresses
- Overlay-based redaction hides text visually but doesn't remove it from the file
- No confidence scoring — you can't tell good detections from false positives
- No audit trail — you can't prove what was removed or when
What PDFDancer Changes
- ML-powered detection — context-aware entity recognition across 20+ PII types
- True binary-level removal — content permanently deleted, not covered up
- Confidence scores — filter detections by threshold to control precision vs. recall
- Audit trails — verifiable proof of what was redacted and when
See It in Action
HIPAA-Compliant PII Redaction in Java
ML-powered entity detection across 20+ PII categories with confidence scoring. Filter by threshold to control precision vs. recall.
Performance
Redaction Accuracy Benchmarks
Real benchmark results from PDFDancer's automated redaction engine on common HIPAA entity categories. You control the confidence threshold and decide what to redact.
| HIPAA Entity Category | Precision | Recall | F1 Score |
|---|---|---|---|
| Person | 97.43% | 96.28% | 0.969 |
| Dates of Birth | 100% | 92.57% | 0.961 |
| Account Number / SSN | 85.27% | 93.93% | 0.894 |
| Addresses | 99.43% | 91.22% | 0.951 |
| Phone / Fax Numbers | 94.12% | 96.3% | 0.952 |
| Email Addresses | 99.58% | 99.98% | 0.998 |
Questions
Frequently Asked Questions
Let’s Talk About Your Use Case
15-minute call — we’ll walk through your document pipeline and show how PDFDancer fits.