Skip to main content

PDFDancer

Automated Redaction SDK

Accurate enough to automate.

A developer SDK for compliance-grade PII detection and true binary-level redaction. Powered by our own ML model. Deploy on-prem or in the cloud.

How the engine works

Not an app — infrastructure. Integrate into your document pipeline via SDK or on-prem deployment. You control the workflow and thresholds. We handle detection and removal.

Semantic Analysis

Understands document structure and context, not just keyword matching. Handles invisible text, vector text, and text in images.

Purpose-Built ML

Trained on a massive synthetic dataset. Runs on our infrastructure — no data sent to external AI providers. Returns labeled findings with confidence scores.

True Redaction

Binary-level removal from the PDF. The underlying data isn't covered up — it's permanently eliminated from the file.

From raw document to clean output

1
Ingest

Feed in PDFs: scanned, digital, or mixed.

2
OCR

Extract text from scanned pages, images, and non-standard text layers.

3
Analyze

Semantic engine parses document structure, context, and entity relationships.

4
Classify

ML engine labels every detected entity with a confidence score.

5
You Decide

Your logic sets the rules: which labels, what threshold, what action.

6
Redact

Binary-level removal. Clean at the file level, not cosmetically masked.

Redaction by industry

The engine is the same. The workflows, entity types, and compliance requirements differ. See how teams in your industry use PDFDancer.

Legal & eDiscovery

Automated PII detection for discovery, contract redaction, and FOIA compliance. Confidence-scored findings with audit trails for court.

See the legal solution →

Clinical Trials & Healthcare

CSR redaction for EMA Policy 0070, patient de-identification for HIPAA, and TMF batch processing with recall-optimized detection.

See the clinical trials solution →

Financial Services

Redact PII from loan applications, KYC files, and audit trails. Confidence scoring tuned to your risk tolerance.

Coming soon →

Government & Public Records

FOIA-ready document preparation with automated PII removal. Deployable on-prem for strict data residency requirements.

Coming soon →

We publish our numbers. Here they are.

Redaction is high-stakes. You need to know exactly how well the engine performs before you trust it with real data. We agree — so we don't hide behind vague claims.

Detection Performance

By HIPAA entity category, measured on our English-language benchmark dataset:

CategoryRecallPrecisionF1 Score
Person96.28%97.43%0.969
Dates of Birth92.57%100%0.961
Account Number / SSN93.93%85.27%0.894
Addresses91.22%99.43%0.951
Phone / Fax Numbers96.3%94.12%0.952
Email Addresses99.98%99.58%0.998

In Production

In our current pilot with a legal services provider, the SDK processes thousands of pages per month with high accuracy on first pass. Manual review time dropped significantly compared to their previous workflow.

Compliance & Certifications

We are certified / compliant

ISO 27001 — certified
GDPR — compliant
Own Infrastructure — no third-party AI providers

These apply to us. We run our own ML models on a ISO 27001-certified compute infrastructure. Your documents are never sent to external AI providers.

We help you build compliant workflows for

HIPAA
UK GDPR
CCPA
Safe Harbor
21 CFR Part 11
EMA Policy 0070

The SDK is designed to support your organization's compliance with these frameworks. How you integrate, configure, and deploy it determines your compliance posture — we give you the tools and guidance to get there.

What we don't do (yet)

We believe being upfront about scope builds more trust than a features page that over-promises.

Non-text content

The engine processes text. It does not detect or redact faces in photographs, visible handwritten signatures, logos, or other graphical elements.

Languages beyond our current set

We support English, German, Spanish, French, and Italian. Additional languages are on our roadmap — talk to us if your use case requires others.

Entities that span page boundaries

The engine analyzes each page independently. If an entity (such as a name or address) starts on one page and continues on the next, we may miss part of it. This is a known gap for documents with dense, flowing text across page breaks.

More than an API call

You're not just licensing an engine. Depending on your plan, you get the support to deploy it properly and keep it running.

Consulting

We help you scope the integration — document types, entity categories, confidence thresholds, edge cases specific to your domain.

Implementation

Hands-on support for deployment, whether you're calling our cloud API or installing on-prem in a locked-down environment.

Ongoing Support

SLAs, model updates, new language and entity support as we ship it, and a dedicated account contact for enterprise customers.

Simple pricing. No per-seat licenses.

On-Prem / Enterprise
Custom

For teams that need data residency, high-volume pricing, custom SLAs, or dedicated support.

Integrate with your stack

Use the redaction SDK from your language of choice. All SDKs support the full redaction pipeline.

15 minutes. No pitch deck.

Book a short call and we'll figure out if we're a fit for your use case. We'll ask about your document types, volume, and compliance requirements — and tell you honestly whether our SDK is the right tool.