Extract Text by Expression - RegEx Search API
PDF4me Extract Text by Expression enables you to extract specific text from PDF documents using regular expressions. This API service processes PDF files and extracts text matching specific patterns/expressions from PDF documents. The API receives PDF content and regular expression patterns through REST API calls, utilizing Base64 encoding for secure transmission. With support for complex regular expressions and flexible page targeting, this solution is ideal for document processing and data extraction workflows.
Authenticating Your API Request
To access the PDF4me REST API, every request must include proper authentication credentials. Authentication ensures secure communication and validates your identity as an authorized user of the REST API.
Key Features
- Regular Expression Support: Extract text using regex patterns for precise text matching
- Flexible Page Targeting: Process specific pages or entire documents with custom page sequences
- Pattern Matching: Support for complex regular expressions and pattern recognition
- Base64 Encoding: Secure file content transmission using Base64 encoding
- Simple API Integration: RESTful API designed for automated text extraction workflows
REST API Endpoint
The PDF4me REST API uses standard HTTP methods to interact with resources. All text extraction by expression operations are performed through a single endpoint:
- Method: POST
- Endpoint:
/api/v2/ExtractTextByExpression
REST API Parameters
Complete list of parameters for the Extract Text by Expression REST API. Parameters are organized by category for better understanding and implementation.
Important: Parameters marked with an asterisk (*) are required and must be provided for the API to function correctly.
Required Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
| docContent* | Base64 (String) | The content of the input PDF file in Base64 format | JVBERi... |
| docName* | String | Source PDF file name with proper file extension | output.pdf |
| expression* | String | Regular expression pattern for text extraction. Supports standard regex syntax including groups, quantifiers, and anchors | % or [A-Za-z]+ |
| pageSequence* | String | Specify which pages to process. Use "1-" for all pages, "1-3" for range, or "1,2,3" for specific pages | 1-3 |
Optional Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
| async | Boolean | Enable asynchronous processing. When true, the API returns 202 Accepted with a Location header for polling the result | true |
Output
The PDF4me Extract Text by Expression REST API returns different responses based on the processing mode. The API returns extracted text matches as a JSON response.
- Success Response
- Asynchronous Processing
- Error Responses
- Response Format Details
Synchronous Processing (Default)
When async is false or not provided, the API returns the extracted text matches immediately.
Status Code: 200 OK
Response Format:
{
"textList": ["extracted text 1", "extracted text 2", "extracted text 3"]
}
The response contains an array of text strings matching the specified regular expression pattern.
Asynchronous Processing
When async is true, the API processes the document asynchronously.
Initial Response:
Status Code: 202 Accepted
Response Headers:
Location: https://api.pdf4me.com/api/v2/ExtractTextByExpressionStatus/{operationId}
Response Body:
{
"traceId": "operation-trace-id"
}
Polling for Results:
Use the Location header URL to poll for completion:
const response = await fetch(locationUrl, {
headers: { 'Authorization': 'Basic ' + apiKey }
});
// Continue polling until status code is 200
if (response.status === 200) {
const result = await response.json();
// Process extracted text matches
}
Error Responses
| Status Code | Description | Example Response |
|---|---|---|
| 400 Bad Request | Invalid request parameters, missing required fields, or invalid regex pattern | {"error": "Missing required parameter: expression"} |
| 401 Unauthorized | Invalid or missing API key | {"error": "Unauthorized"} |
| 408 Request Timeout | Request processing timeout | {"error": "Request timeout"} |
| 500 Internal Server Error | Server error during processing | {"error": "Internal server error"} |
Understanding the JSON Response
The text extraction response is a JSON object containing:
- textList: Array of strings containing all text matches that match the specified regular expression pattern
Page Sequence Formats:
"1-": Process all pages from page 1 to the end"1-3": Process pages 1, 2, and 3"1,2,3": Process specific pages 1, 2, and 3
Expression Examples:
"%": Match percentage symbols"\\d+": Match one or more digits"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}": Match email addresses"https?://[^\\s]+": Match URLs
Request Example
Header
Content-Type: application/json
Authorization: Basic YOUR_BASE64_ENCODED_API_KEY
Note:
- Get your API key from the PDF4me Dashboard
- The API key must be Base64 encoded and prefixed with "Basic " in the Authorization header
- Example: If your API key is
abc123, encode it to Base64 and useAuthorization: Basic YWJjMTIz
Payload
Basic Request:
{
"docContent": "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCjIgMCBvYmoKPDwKL1R5cGUgL1BhZ2VzCi9LaWRzIFszIDAgUl0KL0NvdW50IDEKPD4KZW5kb2JqCjMgMCBvYmoKPDwKL1R5cGUgL1BhZ2UKL1BhcmVudCAyIDAgUgovTWVkaWFCb3ggWzAgMCA2MTIgNzkyXQovUmVzb3VyY2VzIDw8Ci9Gb250IDw8Ci9GMSA0IDAgUgo+Pgo+PgovQ29udGVudHMgNSAwIFIKPj4KZW5kb2JqCjQgMCBvYmoKPDwKL1R5cGUgL0ZvbnQKL1N1YnR5cGUgL1R5cGUxCi9CYXNlRm9udCAvSGVsdmV0aWNhCj4+CmVuZG9iago1IDAgb2JqCjw8Ci9MZW5ndGggNDQKPj4Kc3RyZWFtCkJUCi9GMSAxMiBUZgoxMDAgNzAwIFRkCihIZWxsbyBXb3JsZCkgVGoKRVQKZW5kc3RyZWFtCmVuZG9iagp4cmVmCjAgNgowMDAwMDAwMDAwIDY1NTM1IGYgCjAwMDAwMDAwMDkgMDAwMDAgbiAKMDAwMDAwMDA1NCAwMDAwMCBuIAowMDAwMDAwMTAxIDAwMDAwIG4gCjAwMDAwMDAxNzAgMDAwMDAgbiAKMDAwMDAwMDI0NCAwMDAwMCBuIAp0cmFpbGVyCjw8Ci9TaXplIDYKL1Jvb3QgMSAwIFIKPj4Kc3RhcnR4cmVmCjM0MQolJUVPRg==",
"docName": "output.pdf",
"expression": "%",
"pageSequence": "1-3"
}
With Asynchronous Processing:
{
"docContent": "JVBERi0xLjQKJeLjz9MK...",
"docName": "output.pdf",
"expression": "\\d+",
"pageSequence": "1-3",
"async": true
}
Code Samples
The PDF4me Extract Text by Expression REST API provides code samples in multiple programming languages. Choose the language that best fits your development environment:
- C#
- Java
- JavaScript
- Python
- Salesforce
- n8n
- Google Script
- AWS Lambda
C# (CSharp) Sample
Complete C# implementation for extracting text by expression from PDF:
Google Script Sample
Google Apps Script implementation for Google Workspace integration:
Text Extraction Features
Regular Expression Processing
- Pattern Matching: Full support for standard regular expression syntax and patterns
- Complex Patterns: Support for advanced regex features including groups, quantifiers, and anchors
- Custom Expressions: Flexible pattern creation for specific text extraction requirements
- Pattern Validation: Built-in validation for regex pattern syntax and compatibility
- Professional Results: Reliable text extraction with accurate pattern matching
Page Processing
- Page Targeting: Extract text from specific pages or entire documents
- Page Sequences: Support for custom page ranges and individual page selection
- Flexible Processing: Process any combination of pages with precise control
- Batch Processing: Handle multiple pages efficiently in single API calls
- Professional Layout: Consistent text extraction across all target pages
Advanced Features
- Text Analysis: Comprehensive text pattern analysis and identification
- Custom Filtering: Advanced filtering options for specific text extraction needs
- Professional Extraction: High-quality text extraction with clear visibility
- Flexible Patterns: Support for any regular expression pattern and text matching requirements
Industry Use Cases & Applications
- Legal & Professional Services
- Finance & Banking
- Business & Enterprise
- Education & Research
- Government & Compliance
Legal & Professional Services Use Cases
- Document Analysis: Extract specific data patterns from contracts, invoices, and legal documents
- Contract Analysis: Extract key terms and data patterns from legal contracts
- Compliance Monitoring: Extract regulatory information and compliance data from documents
- Legal Data Extraction: Extract specific data patterns from legal documents
Finance & Banking Use Cases
- Invoice Processing: Extract invoice data and field values using pattern matching
- Financial Reports: Extract specific metrics and data points from PDF reports
- Compliance Monitoring: Extract regulatory information and compliance data from documents
- Financial Data Mining: Extract structured data from financial documents
Business & Enterprise Use Cases
- Data Mining: Identify and extract structured data from unstructured PDF documents
- Content Processing: Extract specific text patterns for content management and analysis
- Form Processing: Extract form data and field values using pattern matching
- Business Intelligence: Extract key business metrics and performance indicators from PDF reports
Education & Research Use Cases
- Research Applications: Extract specific data patterns from research papers and academic documents
- Academic Data Extraction: Extract structured data from research papers
- Research Analysis: Extract specific metrics from academic documents
- Educational Content Processing: Extract text patterns from educational documents
Government & Compliance Use Cases
- Compliance Monitoring: Extract regulatory information and compliance data from documents
- Regulatory Data Extraction: Extract structured data from regulatory documents
- Government Reports: Extract specific metrics from government reports
- Public Records: Extract data patterns from public records and documents