Extract Text from Word - Content Parser API
PDF4me Extract Text from Word enables you to extract text content from Word documents with options for page range and content filtering. This API service processes Word files and extracts text content from Word documents with customizable options. The API receives Word document content through REST API calls, utilizing Base64 encoding for secure transmission. With support for page range selection, comment removal, header/footer filtering, and change acceptance, this solution is ideal for document processing and content analysis workflows.
Authenticating Your API Request
To access the PDF4me REST API, every request must include proper authentication credentials. Authentication ensures secure communication and validates your identity as an authorized user of the REST API.
Key Features
- Word Document Support: Extract text from Microsoft Word documents (.doc, .docx)
- Page Range Selection: Process specific page ranges or entire documents with custom start and end page numbers
- Content Filtering: Remove comments, headers, footers, and track changes for clean text extraction
- Base64 Encoding: Secure file content transmission using Base64 encoding
- Simple API Integration: RESTful API designed for automated Word document processing workflows
REST API Endpoint
The PDF4me REST API uses standard HTTP methods to interact with resources. All Word text extraction operations are performed through a single endpoint:
- Method: POST
- Endpoint:
/api/v2/ExtractTextFromWord
REST API Parameters
Complete list of parameters for the Extract Text from Word REST API. Parameters are organized by category for better understanding and implementation.
Important: Parameters marked with an asterisk (*) are required and must be provided for the API to function correctly.
Required Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
| docName* | String | Source Word document file name (without extension or with .docx/.doc extension) | output |
| docContent* | Base64 (String) | The content of the input Word document in Base64 format | JVBERi... |
| StartPageNumber* | Integer | Starting page number for text extraction | 1 |
| EndPageNumber* | Integer | Ending page number for text extraction | 3 |
| RemoveComments* | Boolean | Choose whether to remove comments from extracted text: true – Remove comments, false – Include comments | true |
| RemoveHeaderFooter* | Boolean | Choose whether to remove headers and footers: true – Remove headers/footers, false – Include headers/footers | true |
| AcceptChanges* | Boolean | Choose whether to accept tracked changes: true – Accept changes, false – Reject changes | true |
Optional Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
| async | Boolean | Enable asynchronous processing. When true, the API returns 202 Accepted with a Location header for polling the result | false |
Output
The PDF4me Extract Text from Word REST API returns different responses based on the processing mode. The API returns extracted text as a JSON response or text file.
- Success Response
- Asynchronous Processing
- Error Responses
- Response Format Details
Synchronous Processing (Default)
When async is false or not provided, the API returns the extracted text immediately.
Status Code: 200 OK
Response Format:
{
"extractedText": "Extracted text content from Word document pages 1 to 3...",
"fileName": "output.txt"
}
Or as a text file:
Extracted text content from Word document pages 1 to 3...
The response contains the extracted text content from the specified page range.
Asynchronous Processing
When async is true, the API processes the document asynchronously.
Initial Response:
Status Code: 202 Accepted
Response Headers:
Location: https://api.pdf4me.com/api/v2/ExtractTextFromWordStatus/{operationId}
Response Body:
{
"traceId": "operation-trace-id"
}
Polling for Results:
Use the Location header URL to poll for completion:
const response = await fetch(locationUrl, {
headers: { 'Authorization': 'Basic ' + apiKey }
});
// Continue polling until status code is 200
if (response.status === 200) {
const result = await response.json();
// Process extracted text
}
Error Responses
| Status Code | Description | Example Response |
|---|---|---|
| 400 Bad Request | Invalid request parameters or missing required fields | {"error": "Missing required parameter: docContent"} |
| 401 Unauthorized | Invalid or missing API key | {"error": "Unauthorized"} |
| 408 Request Timeout | Request processing timeout | {"error": "Request timeout"} |
| 500 Internal Server Error | Server error during processing | {"error": "Internal server error"} |
Understanding the Response
The text extraction response can be in two formats:
- JSON Format: Contains
extractedText(string) andfileName(string) - Text Format: Plain text file with the extracted content
Parameter Details:
- StartPageNumber and EndPageNumber: Define the page range (1-based indexing)
- RemoveComments: When
true, removes all comments from the Word document before extraction - RemoveHeaderFooter: When
true, removes headers and footers from each page before extraction - AcceptChanges: When
true, accepts all tracked changes in the Word document before extraction
Supported Word Formats:
.docx(Word 2007 and later).doc(Word 97-2003)
Request Example
Header
Content-Type: application/json
Authorization: Basic YOUR_BASE64_ENCODED_API_KEY
Note:
- Get your API key from the PDF4me Dashboard
- The API key must be Base64 encoded and prefixed with "Basic " in the Authorization header
- Example: If your API key is
abc123, encode it to Base64 and useAuthorization: Basic YWJjMTIz
Payload
Basic Request:
{
"docContent": "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCjIgMCBvYmoKPDwKL1R5cGUgL1BhZ2VzCi9LaWRzIFszIDAgUl0KL0NvdW50IDEKPD4KZW5kb2JqCjMgMCBvYmoKPDwKL1R5cGUgL1BhZ2UKL1BhcmVudCAyIDAgUgovTWVkaWFCb3ggWzAgMCA2MTIgNzkyXQovUmVzb3VyY2VzIDw8Ci9Gb250IDw8Ci9GMSA0IDAgUgo+Pgo+PgovQ29udGVudHMgNSAwIFIKPj4KZW5kb2JqCjQgMCBvYmoKPDwKL1R5cGUgL0ZvbnQKL1N1YnR5cGUgL1R5cGUxCi9CYXNlRm9udCAvSGVsdmV0aWNhCj4+CmVuZG9iago1IDAgb2JqCjw8Ci9MZW5ndGggNDQKPj4Kc3RyZWFtCkJUCi9GMSAxMiBUZgoxMDAgNzAwIFRkCihIZWxsbyBXb3JsZCkgVGoKRVQKZW5kc3RyZWFtCmVuZG9iagp4cmVmCjAgNgowMDAwMDAwMDAwIDY1NTM1IGYgCjAwMDAwMDAwMDkgMDAwMDAgbiAKMDAwMDAwMDA1NCAwMDAwMCBuIAowMDAwMDAwMTAxIDAwMDAwIG4gCjAwMDAwMDAxNzAgMDAwMDAgbiAKMDAwMDAwMDI0NCAwMDAwMCBuIAp0cmFpbGVyCjw8Ci9TaXplIDYKL1Jvb3QgMSAwIFIKPj4Kc3RhcnR4cmVmCjM0MQolJUVPRg==",
"docName": "output",
"StartPageNumber": 1,
"EndPageNumber": 3,
"RemoveComments": true,
"RemoveHeaderFooter": true,
"AcceptChanges": true
}
With Asynchronous Processing:
{
"docContent": "JVBERi0xLjQKJeLjz9MK...",
"docName": "output",
"StartPageNumber": 1,
"EndPageNumber": 3,
"RemoveComments": true,
"RemoveHeaderFooter": true,
"AcceptChanges": true,
"async": true
}
Code Samples
The PDF4me Extract Text from Word REST API provides code samples in multiple programming languages. Choose the language that best fits your development environment:
- C#
- Java
- JavaScript
- Python
- Salesforce
- n8n
- Google Script
- AWS Lambda
Google Script Sample
Google Apps Script implementation for Google Workspace integration:
Word Text Extraction Features
Document Processing
- Word Format Support: Full support for .doc and .docx Microsoft Word document formats
- Format Preservation: Maintains text formatting and structure during extraction
- Complex Documents: Handles complex Word documents with tables, images, and embedded content
- Professional Processing: High-quality text extraction with accurate content preservation
- Flexible Input: Support for various Word document versions and formats
Page Processing
- Page Range Selection: Extract text from specific page ranges with start and end page numbers
- Flexible Targeting: Process individual pages, page ranges, or entire documents
- Custom Processing: Support for any page combination with precise control
- Batch Processing: Handle multiple pages efficiently in single API calls
- Professional Layout: Consistent text extraction across all target pages
Content Filtering
- Comment Removal: Option to exclude comments and annotations from extracted text
- Header/Footer Filtering: Remove headers and footers for clean content extraction
- Change Tracking: Accept or reject tracked changes during text extraction
- Content Cleanup: Advanced filtering options for specific text extraction needs
- Professional Results: Clean, formatted text output with accurate content preservation
Industry Use Cases & Applications
- Business & Enterprise
- Legal & Professional Services
- Education & Research
- Government & Compliance
- Technology & Development
Business & Enterprise Use Cases
- Document Analysis: Extract text content from Word documents for analysis and processing
- Content Management: Extract text from Word documents for content management systems
- Report Processing: Extract text from Word reports for automated processing and analysis
- Business Intelligence: Extract text from Word business documents for data analysis
Legal & Professional Services Use Cases
- Legal Document Processing: Extract text from legal Word documents for case management
- Contract Analysis: Extract text from contracts and legal documents
- Case Management: Extract text from legal documents for case management systems
- Legal Research: Extract text from legal documents for research and analysis
Education & Research Use Cases
- Academic Research: Extract text from academic Word documents for research and analysis
- Research Documents: Extract text from research papers and academic documents
- Educational Content: Extract text from educational Word documents
- Academic Analysis: Extract text from academic documents for analysis
Government & Compliance Use Cases
- Compliance Monitoring: Extract text from Word compliance documents for monitoring and auditing
- Regulatory Documentation: Extract text from regulatory Word documents
- Government Reports: Extract text from government Word documents
- Public Records: Extract text from public record Word documents
Technology & Development Use Cases
- Data Migration: Convert Word document content to other formats for data migration
- Content Processing: Extract text from Word documents for system integration
- API Integration: Extract text from Word documents for API processing
- Data Extraction: Extract text from Word documents for data processing