Skip to main content

Extract Text from Word - Content Parser API

PDF4me Extract Text from Word enables you to extract text content from Word documents with options for page range and content filtering. This API service processes Word files and extracts text content from Word documents with customizable options. The API receives Word document content through REST API calls, utilizing Base64 encoding for secure transmission. With support for page range selection, comment removal, header/footer filtering, and change acceptance, this solution is ideal for document processing and content analysis workflows.

Authenticating Your API Request

To access the PDF4me REST API, every request must include proper authentication credentials. Authentication ensures secure communication and validates your identity as an authorized user of the REST API.

Key Features

  • Word Document Support: Extract text from Microsoft Word documents (.doc, .docx)
  • Page Range Selection: Process specific page ranges or entire documents with custom start and end page numbers
  • Content Filtering: Remove comments, headers, footers, and track changes for clean text extraction
  • Base64 Encoding: Secure file content transmission using Base64 encoding
  • Simple API Integration: RESTful API designed for automated Word document processing workflows

REST API Endpoint

The PDF4me REST API uses standard HTTP methods to interact with resources. All Word text extraction operations are performed through a single endpoint:

  • Method: POST
  • Endpoint: /api/v2/ExtractTextFromWord

REST API Parameters

Complete list of parameters for the Extract Text from Word REST API. Parameters are organized by category for better understanding and implementation.

Important: Parameters marked with an asterisk (*) are required and must be provided for the API to function correctly.

Required Parameters

ParameterTypeDescriptionExample
docName*StringSource Word document file name (without extension or with .docx/.doc extension)output
docContent*Base64 (String)The content of the input Word document in Base64 formatJVBERi...
StartPageNumber*IntegerStarting page number for text extraction1
EndPageNumber*IntegerEnding page number for text extraction3
RemoveComments*BooleanChoose whether to remove comments from extracted text: true – Remove comments, false – Include commentstrue
RemoveHeaderFooter*BooleanChoose whether to remove headers and footers: true – Remove headers/footers, false – Include headers/footerstrue
AcceptChanges*BooleanChoose whether to accept tracked changes: true – Accept changes, false – Reject changestrue

Optional Parameters

ParameterTypeDescriptionExample
asyncBooleanEnable asynchronous processing. When true, the API returns 202 Accepted with a Location header for polling the resultfalse

Output

The PDF4me Extract Text from Word REST API returns different responses based on the processing mode. The API returns extracted text as a JSON response or text file.

Synchronous Processing (Default)

When async is false or not provided, the API returns the extracted text immediately.

Status Code: 200 OK

Response Format:

{
"extractedText": "Extracted text content from Word document pages 1 to 3...",
"fileName": "output.txt"
}

Or as a text file:

Extracted text content from Word document pages 1 to 3...

The response contains the extracted text content from the specified page range.

Request Example

Content-Type: application/json
Authorization: Basic YOUR_BASE64_ENCODED_API_KEY

Note:

  • Get your API key from the PDF4me Dashboard
  • The API key must be Base64 encoded and prefixed with "Basic " in the Authorization header
  • Example: If your API key is abc123, encode it to Base64 and use Authorization: Basic YWJjMTIz

Payload

Basic Request:

{
"docContent": "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMiAwIFIKPj4KZW5kb2JqCjIgMCBvYmoKPDwKL1R5cGUgL1BhZ2VzCi9LaWRzIFszIDAgUl0KL0NvdW50IDEKPD4KZW5kb2JqCjMgMCBvYmoKPDwKL1R5cGUgL1BhZ2UKL1BhcmVudCAyIDAgUgovTWVkaWFCb3ggWzAgMCA2MTIgNzkyXQovUmVzb3VyY2VzIDw8Ci9Gb250IDw8Ci9GMSA0IDAgUgo+Pgo+PgovQ29udGVudHMgNSAwIFIKPj4KZW5kb2JqCjQgMCBvYmoKPDwKL1R5cGUgL0ZvbnQKL1N1YnR5cGUgL1R5cGUxCi9CYXNlRm9udCAvSGVsdmV0aWNhCj4+CmVuZG9iago1IDAgb2JqCjw8Ci9MZW5ndGggNDQKPj4Kc3RyZWFtCkJUCi9GMSAxMiBUZgoxMDAgNzAwIFRkCihIZWxsbyBXb3JsZCkgVGoKRVQKZW5kc3RyZWFtCmVuZG9iagp4cmVmCjAgNgowMDAwMDAwMDAwIDY1NTM1IGYgCjAwMDAwMDAwMDkgMDAwMDAgbiAKMDAwMDAwMDA1NCAwMDAwMCBuIAowMDAwMDAwMTAxIDAwMDAwIG4gCjAwMDAwMDAxNzAgMDAwMDAgbiAKMDAwMDAwMDI0NCAwMDAwMCBuIAp0cmFpbGVyCjw8Ci9TaXplIDYKL1Jvb3QgMSAwIFIKPj4Kc3RhcnR4cmVmCjM0MQolJUVPRg==",
"docName": "output",
"StartPageNumber": 1,
"EndPageNumber": 3,
"RemoveComments": true,
"RemoveHeaderFooter": true,
"AcceptChanges": true
}

With Asynchronous Processing:

{
"docContent": "JVBERi0xLjQKJeLjz9MK...",
"docName": "output",
"StartPageNumber": 1,
"EndPageNumber": 3,
"RemoveComments": true,
"RemoveHeaderFooter": true,
"AcceptChanges": true,
"async": true
}

Code Samples

The PDF4me Extract Text from Word REST API provides code samples in multiple programming languages. Choose the language that best fits your development environment:

C# (CSharp) Sample

Complete C# implementation for extracting text from Word documents:

Word Text Extraction Features

Document Processing

  • Word Format Support: Full support for .doc and .docx Microsoft Word document formats
  • Format Preservation: Maintains text formatting and structure during extraction
  • Complex Documents: Handles complex Word documents with tables, images, and embedded content
  • Professional Processing: High-quality text extraction with accurate content preservation
  • Flexible Input: Support for various Word document versions and formats

Page Processing

  • Page Range Selection: Extract text from specific page ranges with start and end page numbers
  • Flexible Targeting: Process individual pages, page ranges, or entire documents
  • Custom Processing: Support for any page combination with precise control
  • Batch Processing: Handle multiple pages efficiently in single API calls
  • Professional Layout: Consistent text extraction across all target pages

Content Filtering

  • Comment Removal: Option to exclude comments and annotations from extracted text
  • Header/Footer Filtering: Remove headers and footers for clean content extraction
  • Change Tracking: Accept or reject tracked changes during text extraction
  • Content Cleanup: Advanced filtering options for specific text extraction needs
  • Professional Results: Clean, formatted text output with accurate content preservation

Industry Use Cases & Applications

Business & Enterprise Use Cases

  • Document Analysis: Extract text content from Word documents for analysis and processing
  • Content Management: Extract text from Word documents for content management systems
  • Report Processing: Extract text from Word reports for automated processing and analysis
  • Business Intelligence: Extract text from Word business documents for data analysis

Get Help