PDFDancer
Accurate enough to automate.
A developer SDK for compliance-grade PII detection and true binary-level redaction. Powered by our own ML model. Deploy on-prem or in the cloud.
A developer SDK for compliance-grade PII detection and true binary-level redaction. Powered by our own ML model. Deploy on-prem or in the cloud.
Not an app — infrastructure. Integrate into your document pipeline via SDK or on-prem deployment. You control the workflow and thresholds. We handle detection and removal.
Understands document structure and context, not just keyword matching. Handles invisible text, vector text, and text in images.
Trained on a massive synthetic dataset. Runs on our infrastructure — no data sent to external AI providers. Returns labeled findings with confidence scores.
Binary-level removal from the PDF. The underlying data isn't covered up — it's permanently eliminated from the file.
Feed in PDFs: scanned, digital, or mixed.
Extract text from scanned pages, images, and non-standard text layers.
Semantic engine parses document structure, context, and entity relationships.
ML engine labels every detected entity with a confidence score.
Your logic sets the rules: which labels, what threshold, what action.
Binary-level removal. Clean at the file level, not cosmetically masked.
The engine is the same. The workflows, entity types, and compliance requirements differ. See how teams in your industry use PDFDancer.
Automated PII detection for discovery, contract redaction, and FOIA compliance. Confidence-scored findings with audit trails for court.
See the legal solution →CSR redaction for EMA Policy 0070, patient de-identification for HIPAA, and TMF batch processing with recall-optimized detection.
See the clinical trials solution →Redact PII from loan applications, KYC files, and audit trails. Confidence scoring tuned to your risk tolerance.
Coming soon →FOIA-ready document preparation with automated PII removal. Deployable on-prem for strict data residency requirements.
Coming soon →Redaction is high-stakes. You need to know exactly how well the engine performs before you trust it with real data. We agree — so we don't hide behind vague claims.
By HIPAA entity category, measured on our English-language benchmark dataset:
| Category | Recall | Precision | F1 Score |
|---|---|---|---|
| Person | 96.28% | 97.43% | 0.969 |
| Dates of Birth | 92.57% | 100% | 0.961 |
| Account Number / SSN | 93.93% | 85.27% | 0.894 |
| Addresses | 91.22% | 99.43% | 0.951 |
| Phone / Fax Numbers | 96.3% | 94.12% | 0.952 |
| Email Addresses | 99.98% | 99.58% | 0.998 |
In our current pilot with a legal services provider, the SDK processes thousands of pages per month with high accuracy on first pass. Manual review time dropped significantly compared to their previous workflow.
These apply to us. We run our own ML models on a ISO 27001-certified compute infrastructure. Your documents are never sent to external AI providers.
The SDK is designed to support your organization's compliance with these frameworks. How you integrate, configure, and deploy it determines your compliance posture — we give you the tools and guidance to get there.
We believe being upfront about scope builds more trust than a features page that over-promises.
The engine processes text. It does not detect or redact faces in photographs, visible handwritten signatures, logos, or other graphical elements.
We support English, German, Spanish, French, and Italian. Additional languages are on our roadmap — talk to us if your use case requires others.
The engine analyzes each page independently. If an entity (such as a name or address) starts on one page and continues on the next, we may miss part of it. This is a known gap for documents with dense, flowing text across page breaks.
You're not just licensing an engine. Depending on your plan, you get the support to deploy it properly and keep it running.
We help you scope the integration — document types, entity categories, confidence thresholds, edge cases specific to your domain.
Hands-on support for deployment, whether you're calling our cloud API or installing on-prem in a locked-down environment.
SLAs, model updates, new language and entity support as we ship it, and a dedicated account contact for enterprise customers.
Detection, redaction, and audit trail included. Requires Pro plan ($199/month). No minimum volume.
Example: 10,000 pages/month = $199 + $2,000 = $2,199/month
For teams that need data residency, high-volume pricing, custom SLAs, or dedicated support.
Use the redaction SDK from your language of choice. All SDKs support the full redaction pipeline.
Book a short call and we'll figure out if we're a fit for your use case. We'll ask about your document types, volume, and compliance requirements — and tell you honestly whether our SDK is the right tool.