Skip to main content
Java Redaction SDK

ML-Powered PDF Redaction for Java — Remove PII from Any PDF

Permanent PII removal with audit trails. ML-powered detection across 20+ entity types with confidence scoring. HIPAA, GDPR, CCPA compliant.

Why PDF Redaction Is Harder Than It Looks

Finding PII is hard. Regex catches patterns like SSNs, but misses context-dependent data like names and addresses. You need ML to close that gap.

Removing it is harder. PDFs weren't built for editing — what looks like "John Smith" on screen might be scattered across multiple internal objects. Most tools just draw black boxes over text, but the original content stays in the file.

Get either side wrong and you have a compliance gap.

The Limitations

  • Pattern matching alone misses context-dependent PII like names and addresses
  • Overlay-based redaction hides text visually but doesn't remove it from the file
  • No confidence scoring — you can't tell good detections from false positives
  • No audit trail — you can't prove what was removed or when

What PDFDancer Changes

  • ML-powered detection — context-aware entity recognition across 20+ PII types
  • True binary-level removal — content permanently deleted, not covered up
  • Confidence scores — filter detections by threshold to control precision vs. recall
  • Audit trails — verifiable proof of what was redacted and when

HIPAA-Compliant PII Redaction in Java

ML-powered entity detection across 20+ PII categories with confidence scoring. Filter by threshold to control precision vs. recall.

Redaction Accuracy Benchmarks

Real benchmark results from PDFDancer's automated redaction engine on common HIPAA entity categories. You control the confidence threshold and decide what to redact.

HIPAA Entity CategoryPrecisionRecallF1 Score
Person97.43%96.28%0.969
Dates of Birth100%92.57%0.961
Account Number / SSN85.27%93.93%0.894
Addresses99.43%91.22%0.951
Phone / Fax Numbers94.12%96.3%0.952
Email Addresses99.58%99.98%0.998

Frequently Asked Questions

Let’s Talk About Your Use Case

15-minute call — we’ll walk through your document pipeline and show how PDFDancer fits.