PDFDancer

Automated Redaction SDK

Accurate enough to automate.

A developer SDK for compliance-grade PII detection and true binary-level redaction. Powered by our own ML model. Deploy on-prem or in the cloud.

How the engine works

Not an app — infrastructure. Integrate into your document pipeline via SDK or on-prem deployment. You control the workflow and thresholds. We handle detection and removal.

Semantic Analysis

Understands document structure and context, not just keyword matching. Handles invisible text, vector text, and text in images.

Purpose-Built ML

Trained on a massive synthetic dataset. Runs on our infrastructure — no data sent to external AI providers. Returns labeled findings with confidence scores.

True Redaction

Binary-level removal from the PDF. The underlying data isn't covered up — it's permanently eliminated from the file.

From raw document to clean output

Ingest

Feed in PDFs: scanned, digital, or mixed.

OCR

Extract text from scanned pages, images, and non-standard text layers.

Analyze

Semantic engine parses document structure, context, and entity relationships.

Classify

ML engine labels every detected entity with a confidence score.

You Decide

Your logic sets the rules: which labels, what threshold, what action.

Redact

Binary-level removal. Clean at the file level, not cosmetically masked.

Redaction by industry

The engine is the same. The workflows, entity types, and compliance requirements differ. See how teams in your industry use PDFDancer.

Legal & eDiscovery

Automated PII detection for discovery, contract redaction, and FOIA compliance. Confidence-scored findings with audit trails for court.

See the legal solution →

Clinical Trials & Healthcare

CSR redaction for EMA Policy 0070, patient de-identification for HIPAA, and TMF batch processing with recall-optimized detection.

See the clinical trials solution →

Financial Services

Redact PII from loan applications, KYC files, and audit trails. Confidence scoring tuned to your risk tolerance.

Coming soon →

Government & Public Records

FOIA-ready document preparation with automated PII removal. Deployable on-prem for strict data residency requirements.

Coming soon →

We publish our numbers. Here they are.

Redaction is high-stakes. You need to know exactly how well the engine performs before you trust it with real data. We agree — so we don't hide behind vague claims.

Detection Performance

By HIPAA entity category, measured on our English-language benchmark dataset:

Category	Recall	Precision	F1 Score
Person	96.28%	97.43%	0.969
Dates of Birth	92.57%	100%	0.961
Account Number / SSN	93.93%	85.27%	0.894
Addresses	91.22%	99.43%	0.951
Phone / Fax Numbers	96.3%	94.12%	0.952
Email Addresses	99.98%	99.58%	0.998

In Production

In our current pilot with a legal services provider, the SDK processes thousands of pages per month with high accuracy on first pass. Manual review time dropped significantly compared to their previous workflow.

Compliance & Certifications

We are certified / compliant

ISO 27001 — certified

GDPR — compliant

Own Infrastructure — no third-party AI providers

These apply to us. We run our own ML models on a ISO 27001-certified compute infrastructure. Your documents are never sent to external AI providers.

We help you build compliant workflows for

HIPAA

UK GDPR

CCPA

Safe Harbor

21 CFR Part 11

EMA Policy 0070

The SDK is designed to support your organization's compliance with these frameworks. How you integrate, configure, and deploy it determines your compliance posture — we give you the tools and guidance to get there.

What we don't do (yet)

We believe being upfront about scope builds more trust than a features page that over-promises.

Non-text content

The engine processes text. It does not detect or redact faces in photographs, visible handwritten signatures, logos, or other graphical elements.

Languages beyond our current set

We support English, German, Spanish, French, and Italian. Additional languages are on our roadmap — talk to us if your use case requires others.

Entities that span page boundaries

The engine analyzes each page independently. If an entity (such as a name or address) starts on one page and continues on the next, we may miss part of it. This is a known gap for documents with dense, flowing text across page breaks.

More than an API call

You're not just licensing an engine. Depending on your plan, you get the support to deploy it properly and keep it running.

Consulting

We help you scope the integration — document types, entity categories, confidence thresholds, edge cases specific to your domain.

Implementation

Hands-on support for deployment, whether you're calling our cloud API or installing on-prem in a locked-down environment.

Ongoing Support

SLAs, model updates, new language and entity support as we ship it, and a dedicated account contact for enterprise customers.

Simple pricing. No per-seat licenses.

Cloud — Pay as you go

$0.20 / page

Detection, redaction, and audit trail included. Requires Pro plan ($199/month). No minimum volume.

Example: 10,000 pages/month = $199 + $2,000 = $2,199/month

On-Prem / Enterprise

Custom

For teams that need data residency, high-volume pricing, custom SLAs, or dedicated support.

Integrate with your stack

Use the redaction SDK from your language of choice. All SDKs support the full redaction pipeline.

Python Redaction →Node.js Redaction →Java Redaction →

Python SDK →Node.js / TypeScript SDK →Java SDK →

How-to: Redact PDFs →How-to: Batch Redaction →

15 minutes. No pitch deck.

Book a short call and we'll figure out if we're a fit for your use case. We'll ask about your document types, volume, and compliance requirements — and tell you honestly whether our SDK is the right tool.