Using Unstructured Data with Voicebox
A comprehensive guide for developers to integrate and manage unstructured data with Stardog Voicebox using the BITES (Blob Indexing and Text Enrichment with Semantics) system.
Page Contents
- Overview
- Quick Start
- Prerequisites and Requirements
- API Reference
- Document Indexing Pipeline
- Job Configuration
- Querying Indexed Documents
- Deployment
- Deployment Architecture
- Prerequisites
- Step 1: Install Spark Operator
- Step 2: Configure Docker Image Access
- Step 3: Configure vbx_bites_kube_config.yaml
- Step 4: Configure voicebox-service
- Step 5: Configure RBAC
- Step 6: Configure Networking
- Step 7: Deploy voicebox-service
- Cluster Sizing Recommendations
- Logging
- Troubleshooting
- Docker Image Availability
- Additional Resources
Overview
BITES (Blob Indexing and Text Enrichment with Semantics) is Stardog Voicebox’s unstructured data support system. It enables ingestion of documents from various cloud storage providers and local sources, allowing users to query both structured and unstructured data through Voicebox’s conversational AI interface.
What is BITES?
BITES provides an API-first approach to indexing and querying unstructured documents alongside your structured data in Stardog. The system leverages Apache Spark for distributed processing and integrates with your existing Kubernetes infrastructure.
Supported Data Sources
- Google Drive - Cloud document storage
- Microsoft OneDrive - Personal and business cloud storage
- Microsoft SharePoint - Enterprise document management (Document Library only)
- Dropbox - Cloud file storage
- Amazon S3 - Object storage
- Local Storage - File system accessible from Kubernetes environment
Supported Document Formats
- Microsoft Word (DOCX)
The system currently supports parsing and indexing of textual and tabular data. Image parsing within documents is planned for a future release.
Key Capabilities
- Data Ingestion: Automated ingestion from multiple data source types
- Unified Querying: Query both structured and unstructured data through a single Voicebox interface
- API-First Design: All functionality accessible through Launchpad’s public APIs
- Distributed Processing: Spark-based job execution in your Kubernetes environment
- Job Management: Full lifecycle management including status monitoring and cancellation
- Vector Indexing: Document chunks indexed in Stardog’s vector store for semantic search
- Knowledge Graph Creation: Optional extraction of entities and relationships from documents to build knowledge graphs
Beta Features: Information extraction and knowledge graph creation are currently in Beta. These features enable extraction of structured entities and relationships from unstructured text.
Architecture Overview

System Flow:
- User initiates data ingestion and indexing via Launchpad’s public APIs
- Voicebox service creates and submits a Spark job to the Kubernetes cluster
- Spark job processes documents: reads from source, parses content, chunks text
- Processed chunks are indexed in Stardog’s vector store
- (Optional) If information extraction is enabled, entities and relationships are extracted and stored as a knowledge graph
- User queries Voicebox, which retrieves answers from both structured data and indexed documents
Quick Start
This section provides a fast-track setup example for indexing Google Drive documents.
Prerequisites Checklist
- Kubernetes cluster with Spark Operator installed
- Voicebox service and voicebox-bites containers running
- Launchpad API key generated
- Data source credentials configured
- Stardog connection configured
5-Minute Setup (Google Drive)
- Configure Google Drive (see Google Drive Configuration)
- Get API Key from Launchpad → “Manage API Keys”
- Base64 encode your Google service account JSON
- Call the API (see example below)
- Monitor job using the job_id returned
# Example: Initiate indexing job
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"directory": "your-folder-id",
"credentials": "BASE64_ENCODED_SERVICE_ACCOUNT_JSON",
"job_name": "my-indexing-job",
"job_config": {
"document_store_type": "google_drive",
"enhance_content": false,
"extract_information": false
}
}'
Prerequisites and Requirements
Data Source Configuration
Before using BITES APIs, you must configure access credentials for your data sources. Each provider requires specific setup steps.
Google Drive
Setup Steps:
- Navigate to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google Drive API
- Go to “APIs & Services” → “OAuth consent screen” and configure
- Go to “Credentials” → “Create Credentials” → “Service Account”
- Create a Service Account with appropriate permissions
- In the Service Account details, go to the “Keys” tab
- Click “Add Key” → “Create new key” → “JSON”
- Download the JSON key file
Required API Scope:

Add the following scope: https://www.googleapis.com/auth/drive.readonly
IAM Configuration:

Service Account JSON Structure:
{
"type": "service_account",
"project_id": "your-project-id",
"private_key_id": "your-private-key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n",
"client_email": "your-service-account@your-project.iam.gserviceaccount.com",
"client_id": "your-client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-service-account",
"universe_domain": "googleapis.com"
}
Security Note: Store service account credentials securely. Never commit them to version control. Use environment variables or secret management systems.
Sharing Documents:
To allow the service account to access specific folders:
- Copy the
client_emailfrom the service account JSON - Share the Google Drive folder with this email address
- Grant appropriate permissions (Viewer for read-only access)
OneDrive
Setup Steps:
- Go to Azure Portal
- Navigate to “Azure Active Directory” → “App registrations”
- Click “New registration”
- Provide a name and configure redirect URIs if needed
- After registration, note the Application (client) ID and Directory (tenant) ID from the overview page
- Go to “Certificates & Secrets” → “Client secrets” → “New client secret”
- Copy the secret value (not the secret ID)
Required Microsoft Graph Permissions:
- Files.Read
- Files.ReadAll (Delegated)
- Files.ReadAll (Application)
- offline_access (Delegated)
- openid (Delegated)
- Sites.ReadAll (Delegated)
- User.Read (Delegated)

Grant Admin Consent:
After adding permissions, click “Grant admin consent” to approve them for your organization.
Secret Configuration:

Credentials JSON Structure:
{
"tenant_id": "your-tenant-id",
"client_id": "your-client-id",
"client_secret": "your-client-secret"
}
Microsoft SharePoint
Only Document Library is currently supported for SharePoint.
Setup Steps:
- Create an Entra ID (Azure Active Directory) on the Azure Portal
- Create and register an application in Entra ID
- Note the Application (client) ID from the overview page
- Go to “Certificates & Secrets” and create a new client secret
- Copy the secret value
Required Microsoft Graph Permissions:
- Files.Read
- Files.ReadAll (Delegated)
- Sites.ReadAll (Delegated)
- Sites.ReadAll (Application)
- User.Read (Delegated)
Required SharePoint Permissions:
- AllSites.Read (Delegated)
- MyFiles.Read (Delegated)
- Sites.Read.All (Application)
- Sites.Select.All (Application)
- User.Read.All (Application)

Some permissions require M365 administrator approval. Contact your administrator to grant these permissions.
Credentials JSON Structure:
{
"tenant_id": "your-tenant-id",
"client_id": "your-client-id",
"client_secret": "your-client-secret"
}
Additional Required Information:
When calling the indexing API for SharePoint, you must also provide:
host_name: Your SharePoint host (e.g., “yourcompany.sharepoint.com”)site_id: The SharePoint site IDlibrary_name: The document library name
Dropbox
Setup Steps:
- Go to Dropbox App Console
- Click “Create app”
- Choose “Scoped access” and “Full Dropbox” or “App folder”
- Provide an app name
- Go to the “Permissions” tab and enable required scopes:
files.metadata.readfiles.content.read
- Note your App key and App secret
OAuth Authorization Flow:
- In your browser, visit:
https://www.dropbox.com/oauth2/authorize?client_id=<APP_KEY>&token_access_type=offline&response_type=codeReplace
<APP_KEY>with your actual app key. -
Log in to Dropbox and approve the app
-
Copy the authorization code from the redirect URL
- Exchange the authorization code for tokens:
curl -X POST https://api.dropboxapi.com/oauth2/token \ -d code=<AUTHORIZATION_CODE> \ -d grant_type=authorization_code \ -d client_id=<APP_KEY> \ -d client_secret=<APP_SECRET> - The response contains:
access_token: Current access tokenrefresh_token: Long-lived token for obtaining new access tokens
Credentials JSON Structure:
{
"access_token": "your-current-access-token",
"refresh_token": "your-refresh-token",
"client_id": "your-app-key",
"client_secret": "your-app-secret"
}
The BITES connector automatically handles token refresh. If the access token expires, it uses the refresh token to obtain a new one.
Amazon S3
BITES supports two authentication options for S3: IAM roles (recommended) and access keys.
Option 1: IAM Roles (Recommended for AWS-hosted applications)
- Go to AWS Management Console → IAM
- Click “Roles” → “Create role”
- Select the service that will assume this role (e.g., EC2, EKS)
- Attach permissions policies:
AmazonS3ReadOnlyAccess(for read-only access)- Or create a custom policy with minimal required permissions
- Review and create the role
- Attach this role to your Kubernetes nodes or pods
Option 2: Access Key and Secret Key
- Go to AWS Management Console → IAM
- Click “Users” → “Add user”
- Provide a username and select “Programmatic access”
- Attach permissions policies (e.g.,
AmazonS3ReadOnlyAccess) - Complete user creation
- Download the CSV file containing the Access Key ID and Secret Access Key
Security Best Practice: Use IAM roles when possible. If using access keys, rotate them regularly and store them securely.
Required S3 Permissions:
Your IAM role or user must have the following permissions on the target bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucketVersions"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Bucket Configuration:
- Open AWS Console and navigate to S3
- Select your bucket
- Go to “Permissions” tab
- Ensure the IAM role/user has the required permissions listed above
Credentials JSON Structure:
{
"aws_access_key_id": "your-access-key",
"aws_secret_access_key": "your-secret-key",
"region_name": "us-east-1",
"use_iam_role": false
}
For IAM Role Authentication:
{
"region_name": "us-east-1",
"use_iam_role": true
}
Additional Required Information:
When calling the indexing API for S3, you must also provide:
bucket: The S3 bucket name in theextra_argsparameter
Local Storage
Local storage allows indexing of files directly accessible from the Kubernetes environment.
Requirements:
- The directory must be accessible from the Spark executors
- Use Kubernetes volumes (PersistentVolumes, ConfigMaps, or mounted storage)
- Ensure appropriate read permissions
Credentials:
No credentials are required for local storage. Pass an empty JSON object (base64 encoded):
{}
Directory Path:
Provide the absolute path to the directory within the container filesystem (e.g., /mnt/data/documents).
API Key Generation
All BITES APIs require authentication using a Launchpad API key.
Steps to Generate API Key:
- Log in to Launchpad
- Navigate to “Manage API Keys”
- Click “Create API Key”
- Provide a name and select the database
- Copy and securely store the API key
API keys provide full access to your Voicebox instance. Store them securely and never expose them in client-side code or public repositories.
Using the API Key:
Include the API key in the Authorization header of all API requests:
Authorization: Bearer YOUR_API_KEY
Stardog Connection Configuration
BITES jobs need to connect to Stardog to index documents. Two authentication options are supported.
Option 1: Stardog Username and Password
Prerequisites:
- Configure Stardog to generate authentication tokens
- Ensure token generation is enabled in Stardog configuration
How it Works:
When you initiate a job, Stardog generates a token using the provided credentials and passes it to the Spark job.
Configuration:
Provide Stardog credentials when creating the connection in Launchpad.
Option 2: SSO Authentication (Azure/Okta/Ping)
Prerequisites:
- SSO provider configured and integrated with Stardog
- Valid refresh token from SSO provider
- SSO provider client ID
How it Works:
When you initiate a job, the system calls the SSO provider to fetch an access token using the refresh token. This access token is then passed to the Spark job.
Configuration:
When calling the indexing API, provide:
sso_provider_client_id: Your SSO provider’s client IDrefresh_token: A valid refresh token from your SSO provider
Token Expiry Recommendation: Set token expiry based on the expected job duration. For large indexing jobs, we recommend setting the token expiry to 30 days to ensure the job completes without authentication failures.
LLM Provider Configuration
If you plan to use content enhancement (enhance_content: true) or information extraction (extract_information: true), you must configure environment variables for your LLM provider. These variables must be available on both the driver and executor nodes.
Required Environment Variables by Provider
AWS Bedrock
env:
- name: AWS_ACCESS_KEY_ID
value: "your-aws-access-key-id"
- name: AWS_SECRET_ACCESS_KEY
value: "your-aws-secret-access-key"
- name: AWS_REGION
value: "us-east-1"
For AWS Bedrock, ensure your IAM user/role has bedrock:InvokeModel permission for the models you plan to use.
Fireworks AI
env:
- name: FIREWORKS_API_KEY
value: "your-fireworks-api-key"
OpenAI
env:
- name: OPENAI_API_KEY
value: "your-openai-api-key"
Azure OpenAI
env:
- name: AZURE_OPENAI_API_KEY
value: "your-azure-openai-key"
- name: AZURE_OPENAI_ENDPOINT
value: "https://your-resource.openai.azure.com/"
Configuring Environment Variables in Kubernetes
Option 1: Using Kubernetes Secrets (Recommended). Then reference the secret in vbx_bites_kube_config.yaml.
Option 2: Direct Environment Variables (Not Recommended for Production)
Deployment Prerequisites
Before running indexing jobs, ensure your Kubernetes environment is properly configured.
Required Components:
- Kubernetes cluster (version 1.19+)
- Spark Operator installed in the cluster
voicebox-bitesDocker image accessiblevoicebox-servicerunning and configured- Network connectivity between Launchpad, voicebox-service, and voicebox-bites
See the Deployment section for detailed setup instructions.
API Reference
All BITES functionality is accessed through RESTful APIs. This section provides complete API documentation with examples.
Authentication
All API requests must include an Authorization header with your Launchpad API key:
Authorization: Bearer YOUR_API_KEY
Base URL
https://your-launchpad-url/api/v1/voicebox/bites
Replace your-launchpad-url with your actual Launchpad instance URL.
API Endpoints Overview
| Endpoint | Method | Description |
|---|---|---|
/jobs | POST | Initiate a new indexing job |
/jobs/{job_id} | GET | Get the status of a job |
/jobs/{job_id}/cancel | POST | Cancel a running job |
Initiate Indexing Job
Creates and starts a new indexing job in the Spark environment.
Endpoint
POST /api/v1/voicebox/bites/jobs
Request Headers
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json
Request Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| directory | string | Yes | Directory location or ID. For Google Drive: folder ID; OneDrive: folder path; Local: absolute path; |
| credentials | string | Yes | Base64-encoded JSON containing data source credentials. See Data Source Configuration for format |
| job_name | string | Yes | Unique name for the job (used for tracking and management) |
| job_namespace | string | No | Kubernetes namespace for the job. Defaults to namespace in vbx_bites_kube_config.yaml |
| batch_size | integer | No | Number of chunks to commit at once. Default: 1000. Increase for better performance, decrease if memory constrained |
| job_config | object | Yes | Configuration controlling scalability and functionality. See Job Configuration |
| sso_provider_client_id | string | Conditional | Required for SSO authentication. SSO provider’s client ID |
| refresh_token | string | Conditional | Required for SSO authentication. Valid refresh token from SSO provider |
| extra_args | object | No | Additional arguments specific to data source type. See below |
Extra Args by Data Source:
| Data Source | Extra Args Required | Example |
|---|---|---|
| OneDrive | one_drive_id | {"one_drive_id": "b!drive_id"} |
| SharePoint | host_name, site_id, library_name | {"host_name": "company.sharepoint.com", "site_id": "site-id", "library_name": "Documents"} |
| S3 | bucket, prefix | {"bucket": "my-documents-bucket", "prefix: "S3 path" } |
| Google Drive | None | - |
| Dropbox | None | - |
| Local | None | - |
Credentials Format by Data Source
Before passing to the API, you must base64-encode the JSON.
Google Drive:
{
"type": "service_account",
"project_id": "your-project",
"private_key_id": "key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
"client_email": "service-account@project.iam.gserviceaccount.com",
"client_id": "client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/...",
"universe_domain": "googleapis.com"
}
OneDrive / SharePoint:
{
"tenant_id": "your-tenant-id",
"client_id": "your-client-id",
"client_secret": "your-client-secret"
}
Dropbox:
{
"access_token": "current-access-token",
"refresh_token": "refresh-token",
"client_id": "app-key",
"client_secret": "app-secret"
}
S3:
{
"aws_access_key_id": "your-access-key",
"aws_secret_access_key": "your-secret-key",
"region_name": "us-east-1",
"use_iam_role": false
}
Local Storage:
{}
Minimal Request Example
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"directory": "1A2B3C4D5E6F7G8H9I",
"credentials": "eyJ0eXBlIjoic2VydmljZV9hY2NvdW50IiwicHJvamVjdF9pZCI6InlvdXItcHJvamVjdCJ9",
"job_name": "index-google-drive-docs",
"job_config": {
"document_store_type": "google_drive",
"enhance_content": false,
"extract_information": false
}
}'
Complete Request Example with All Options
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"directory": "b!AbCdEf123456",
"credentials": "eyJ0ZW5hbnRfaWQiOiJ5b3VyLXRlbmFudC1pZCIsImNsaWVudF9pZCI6InlvdXItY2xpZW50LWlkIiwiY2xpZW50X3NlY3JldCI6InlvdXItc2VjcmV0In0=",
"job_name": "onedrive-quarterly-reports",
"job_namespace": "voicebox-production",
"batch_size": 2000,
"sso_provider_client_id": "your-sso-client-id",
"refresh_token": "your-refresh-token",
"extra_args": {
"one_drive_id": "b!AbCdEf123456"
},
"job_config": {
"list_file_parallelism": 10,
"content_reader_parallelism": 20,
"content_indexer_parallelism": 10,
"document_store_type": "onedrive",
"enhance_content": true,
"extract_information": true,
"store_list_file_config": {
"page_size": 100,
"recursive": true,
"document_types": ["document", "pdf"]
},
"store_content_loader_config": {
"num_retries": 3,
"store_loader_kwargs": {}
},
"document_loader_config": {
"pdf": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\\n\\n", "\\n", " ", ""],
"chunk_overlap": 200
},
"document": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\\n\\n", "\\n", " ", ""],
"chunk_overlap": 200
}
},
"information_extraction_config": [
{
"task_type": "information_extraction",
"kwargs": {},
"llm_config": {
"max_tokens": 8192,
"temperature": 0.0,
"repetition_penalty": 1.0,
"top_p": 0.7,
"top_k": 50,
"stop": ["---", "</output_format>"],
"llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
"llm_provider": "bedrock",
"context_window": 4000
},
"num_retries": 3,
"prompt_version": "v4",
"extractor_type": "llm",
"query_timeout": 50000
}
]
}
}'
Response
Success Response (HTTP 200):
{
"job_id": "spark-app-1234567890-abcdef",
"error": null
}
Error Response (HTTP 400/500):
{
"job_id": null,
"error": "Failed to create job: Invalid credentials format"
}
Response Fields
| Field | Type | Description |
|---|---|---|
| job_id | string or null | Unique identifier for the created job. Use this to check status or cancel the job |
| error | string or null | Error message if job creation failed, null otherwise |
Get Job Status
Retrieves the current status of an indexing job.
Endpoint
GET /api/v1/voicebox/bites/jobs/{job_id}
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| job_id | string | Yes | Job ID returned when the job was created |
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| job_namespace | string | No | Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml |
Request Example
curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef" \
-H "Authorization: Bearer YOUR_API_KEY"
Response
Success Response (HTTP 200):
{
"status_code": "RUNNING",
"status": "Job is processing documents. Completed 45 of 100 files."
}
Job Not Found (HTTP 404):
{
"status_code": "UNKNOWN",
"status": "Job not found"
}
Response Fields
| Field | Type | Description |
|---|---|---|
| status_code | string | Current state of the job. See status codes below |
| status | string | Human-readable status message with additional details |
Status Codes
| Status Code | Description |
|---|---|
| NEW | Job created but not yet submitted to Spark |
| SUBMITTED | Job submitted to Spark cluster, waiting for resources |
| RUNNING | Job actively processing documents |
| PENDING_RERUN | Job failed and is waiting to be retried |
| INVALIDATING | Job is being invalidated |
| SUCCEEDING | Job is in the process of completing successfully |
| COMPLETED | Job finished successfully |
| ERROR | Job encountered a non-recoverable error |
| FAILING | Job is in the process of failing |
| FAILED | Job failed |
| UNKNOWN | Job status cannot be determined (job may not exist) |
Polling Recommendations
- Poll every 10-30 seconds for jobs expected to complete quickly
- Poll every 1-5 minutes for long-running jobs
- Stop polling when status is
COMPLETED,FAILED, orERROR
Cancel Job
Cancels a running or pending indexing job.
Endpoint
POST /api/v1/voicebox/bites/jobs/{job_id}/cancel
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| job_id | string | Yes | Job ID of the job to cancel |
Request Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| job_name | string | Yes | Name of the job to cancel (must match the name used when creating the job) |
| job_namespace | string | No | Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml |
Request Example
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef/cancel" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"job_name": "index-google-drive-docs"
}'
Response
Success Response (HTTP 200):
{
"success": true,
"error": null
}
Error Response (HTTP 400/500):
{
"success": false,
"error": "Job not found or already completed"
}
Response Fields
| Field | Type | Description |
|---|---|---|
| success | boolean | True if the job was successfully canceled, false otherwise |
| error | string or null | Error message if cancellation failed, null otherwise |
Canceling a job may take a few moments. The Spark operator will gracefully terminate the running executors. Already indexed documents will remain in Stardog.
Document Indexing Pipeline
Understanding the indexing pipeline helps you configure jobs effectively and troubleshoot issues.

Pipeline Stages
1. List Directories
Purpose: Enumerate all directories within the specified location.
Configuration: Controlled by list_file_parallelism and recursive settings.
Output: List of directories to scan for files.
2. List Supported Files
Purpose: Identify all supported document types (PDF, DOCX) within the directories.
Configuration: Filtered by document_types in store_list_file_config.
Output: List of file paths/IDs with metadata (size, modification date, etc.).
3. Fetch Content and Metadata
Purpose: Download file content from the data source and extract metadata.
Configuration: Controlled by content_reader_parallelism. Retries configured via num_retries.
Metadata Captured:
- File name and path
- Creation and modification dates
- Author information (if available)
- File size
- MIME type
Output: Raw file content and associated metadata.
4. Parse and Chunk Content
Purpose: Extract text from documents and split into manageable chunks.
Parsing:
- PDF: Text and Table extraction from PDF files
- DOCX: Text and table extraction from Word documents
Chunking:
- Split text based on
chunk_sizeandchunk_separator - Apply
chunk_overlapto preserve context between chunks - Maintain metadata association with each chunk
Configuration: Controlled by document_loader_config.
Output: Array of text chunks with metadata.
5. Enhance Content (Optional)
Purpose: Enrich document content using LLM processing.
When to Use:
- Documents contain complex tables that need better semantic representation
- Specialized domain content that benefits from LLM enhancement
Process:
- Tables are processed by LLM to generate descriptive text
- Enhanced descriptions improve search relevance
Configuration: Set enhance_content: true in job_config.
Cost Consideration: This step makes additional LLM API calls, increasing processing time and cost.
Output: Enhanced chunks with improved semantic content.
6. Information Extraction (Optional)
Purpose: Extract structured entities and relationships to build a knowledge graph.
When to Use:
- You want to build a knowledge graph from unstructured documents
- Need to identify entities (people, organizations, locations) and their relationships
- Want to enable graph-based queries alongside vector search
Process:
- LLM analyzes each chunk to identify entities and relationships
- Extracts triples in RDF format (subject-predicate-object)
- Links entities across documents
Configuration: Set extract_information: true and configure information_extraction_config.
Output: RDF triples representing extracted knowledge.
Cost Consideration: This step makes additional LLM API calls per chunk, significantly increasing processing time and cost.
7. Index Chunks
Purpose: Store processed chunks and knowledge graph in Stardog.
Indexing Operations:
- Vector Indexing: Chunks are embedded and stored in Stardog’s vector store
- Metadata Indexing: File metadata stored for filtering and source attribution
- Knowledge Graph Storage: Extracted triples stored in specified graph (if information extraction enabled)
Configuration: Controlled by content_indexer_parallelism and batch_size.
Output: Indexed and searchable content in Stardog.
Pipeline Performance Considerations
Bottlenecks:
- Data Source API Limits: Google Drive, OneDrive have rate limits
- Network Bandwidth: Large files take time to download
- LLM Processing: Content enhancement and information extraction are slow
- Stardog Ingestion: High parallelism can overload Stardog
Optimization Tips:
- Start with conservative parallelism settings
- Monitor data source rate limits
- Use content enhancement and information extraction only when necessary
- Increase
batch_sizefor better ingestion throughput - Scale Stardog appropriately for your indexing load
Job Configuration
The job_config parameter controls both scalability and functionality of the indexing pipeline. This section provides detailed guidance on each configuration option.
Configuration Structure
{
"list_file_parallelism": <integer>,
"content_reader_parallelism": <integer>,
"content_indexer_parallelism": <integer>,
"document_store_type": "<string>",
"enhance_content": <boolean>,
"extract_information": <boolean>,
"store_list_file_config": { ... },
"store_content_loader_config": { ... },
"document_loader_config": { ... },
"information_extraction_config": [ ... ]
}
Scalability Configuration
These settings control the degree of parallelism in different pipeline stages.
list_file_parallelism
Purpose: Controls parallel fetching of file listings from subdirectories.
Type: Integer
Default: 5
When to Adjust:
- Increase if you have many subdirectories and fast data source API
- Decrease if hitting data source rate limits
- No effect if
recursive: false
Recommended Values:
- Small dataset (<100 files): 5
- Medium dataset (100-1000 files): 10
- Large dataset (>1000 files): 15-20
Example:
{
"list_file_parallelism": 10
}
Higher values increase pressure on data source APIs (e.g., Google Drive API). Monitor for rate limit errors.
content_reader_parallelism
Purpose: Controls parallel downloading and parsing of document content.
Type: Integer
Default: 10
Impact: This is typically the most impactful scalability setting.
When to Adjust:
- Increase for large datasets with available compute resources
- Decrease if running out of memory or hitting data source rate limits
Recommended Values by Dataset Size:
- Small (<100 files): 10
- Medium (100-1000 files): 20-30
- Large (>1000 files): 30-50
Memory Considerations:
- Each parallel reader consumes memory for document content
- Large PDFs can consume significant memory
- Ensure Spark executors have sufficient memory:
(Average File Size) × (Parallelism) × 2
Example:
{
"content_reader_parallelism": 30
}
content_indexer_parallelism
Purpose: Controls parallel writing of chunks to Stardog vector store.
Type: Integer
Default: 5
When to Adjust:
- Increase if Stardog can handle higher load and you want faster indexing
- Decrease if Stardog shows performance degradation or errors
Recommended Values:
- Default Stardog instance: 5
- Scaled Stardog instance: 10
Warning: Setting this too high can overwhelm Stardog, causing:
- Slow response times
- Connection timeouts
- Out of memory errors
Example:
{
"content_indexer_parallelism": 10
}
Monitor Stardog resource utilization (CPU, memory, disk I/O) when increasing this value. Add resources or reduce parallelism if needed.
Functional Configuration
These settings control the features and behavior of the indexing pipeline.
document_store_type
Purpose: Specifies the data source type.
Type: String
Required: Yes
Valid Values:
google_driveonedrivesharepointdropboxs3local
Example:
{
"document_store_type": "google_drive"
}
enhance_content
Purpose: Enables LLM-based content enhancement for better semantic understanding.
Type: Boolean
Default: false
When to Enable:
- Documents contain complex tables
- Search quality is more important than indexing speed
When to Disable (By Default disabled):
- Simple text documents
- Cost or speed is a concern
- Sufficient search quality without enhancement
Impact:
- Processing Time: Increases by 3-10x depending on content
- Cost: Additional LLM API calls per table/complex structure
- Search Quality: Improved for table-heavy documents
Example:
{
"enhance_content": true
}
extract_information
Purpose: Enables LLM-based entity and relationship extraction to build a knowledge graph.
Type: Boolean
Default: false
When to Enable:
- Need to build a knowledge graph from documents
- Want to identify entities and their relationships
- Require graph-based queries and reasoning
- Need to link information across multiple documents
When to Disable (By Default disabled):
- Only need vector search capabilities and RAG capabilities in voicebox
- Cost or speed is a critical concern
- Documents don’t contain entity-rich content
Impact:
- Processing Time: Increases by 5-20x depending on content
- Cost: Additional LLM API calls per chunk
- Capabilities: Enables knowledge graph queries and relationship discovery
Output:
- RDF triples stored in Stardog
- Entities: People, organizations, locations, concepts
- Relationships: Connections between entities
Example:
{
"extract_information": true
}
Information extraction requires information_extraction_config to be properly configured with LLM settings.
store_list_file_config
Purpose: Controls file discovery behavior.
Type: Object
Fields:
| Field | Type | Default | Description |
|---|---|---|---|
page_size | integer | 100 | Number of files to fetch per API call (pagination) |
recursive | boolean | true | Search subdirectories recursively |
document_types | array | [“document”, “pdf”] | File types to include |
loader_kwargs | object | {} | Additional data-source-specific parameters |
page_size:
- Increase for faster initial file discovery (if data source supports it)
- Decrease if API requests timeout
- Typical range: 50-200
recursive:
true: Index all files in subdirectories (most common)false: Only index files in the specified directory
document_types:
"document": Microsoft Word (DOCX) files"pdf": PDF files- Include both for comprehensive indexing
Example:
{
"store_list_file_config": {
"page_size": 150,
"recursive": true,
"document_types": ["document", "pdf"],
"loader_kwargs": {}
}
}
store_content_loader_config
Purpose: Controls file content fetching behavior.
Type: Object
Fields:
| Field | Type | Default | Description |
|---|---|---|---|
num_retries | integer | 2 | Number of retry attempts for failed downloads |
store_loader_kwargs | object | {} | Additional data-source-specific parameters |
num_retries:
- Recommended: 2-3 for reliability
- Network issues and transient failures are common with cloud storage
- Higher values increase job duration if files are genuinely inaccessible
Example:
{
"store_content_loader_config": {
"num_retries": 3,
"store_loader_kwargs": {}
}
}
document_loader_config
Purpose: Defines parsing and chunking strategy per document type.
Type: Object with nested configuration per document type
Structure:
{
"document_loader_config": {
"pdf": { <pdf-specific config> },
"document": { <docx-specific config> }
}
}
Per-Document-Type Configuration:
| Field | Type | Default | Description |
|---|---|---|---|
chunk_size | integer | 1000 | Maximum characters per chunk |
chunking_enabled | boolean | true | Enable/disable chunking |
chunk_separator | array | [“\n\n”, “\n”, “ “, “”] | Separators for splitting text, in priority order |
chunk_overlap | integer | 0 | Overlapping characters between consecutive chunks |
loader_kwargs | object | {} | Document parser specific parameters |
loader_type | string | varies | Parser type (e.g., “py_pdf”, “DocxLoader”) |
chunk_size:
Determines the size of text segments for indexing.
How to Choose:
- Small (300-500): Better precision, more chunks, slower indexing
- Medium (800-1200): Balanced approach (recommended), by default 1000
- Large (1500-2000): Faster indexing, may lose precision
Considerations:
- LLM context window for embeddings
- Average paragraph/section length in your documents
- Trade-off between precision and recall
chunking_enabled:
true: Documents split into chunks (recommended)false: Entire document treated as single chunk (only for very small documents)
chunk_separator:
Priority-ordered list of separators for splitting text.
Default: ["\n\n", "\n", " ", ""]
- First tries to split on double newlines (paragraphs)
- Then single newlines (lines)
- Then spaces (words)
- Finally characters (as last resort)
Custom Separators:
- Add domain-specific separators (e.g.,
"===Section===") - Maintain hierarchical order (largest semantic units first)
chunk_overlap:
Number of characters to overlap between consecutive chunks.
Purpose: Preserve context across chunk boundaries.
When to Use:
- 0: No overlap, faster indexing, distinct chunks. By default 0.
- 100-200: Moderate overlap for general documents
- 300-500: High overlap for complex documents where context is critical
Example:
{
"document_loader_config": {
"pdf": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\n\n", "\n", " ", ""],
"chunk_overlap": 0,
"loader_kwargs": {},
"loader_type": "py_pdf"
},
"document": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\n\n", "\n", " ", ""],
"chunk_overlap": 0,
"loader_kwargs": {},
"loader_type": "DocxLoader"
}
}
}
Different Configurations by Type:
You can use different settings for PDFs vs. DOCX:
{
"document_loader_config": {
"pdf": {
"chunk_size": 1200,
"chunk_overlap": 200
},
"document": {
"chunk_size": 800,
"chunk_overlap": 100
}
}
}
information_extraction_config
Purpose: Configures LLM-based entity and relationship extraction.
Type: Array of extraction task configurations
Required when: extract_information: true
Structure:
{
"information_extraction_config": [
{
"task_type": "information_extraction",
"kwargs": {},
"llm_config": { <LLM configuration> },
"num_retries": 3,
"prompt_version": "v4",
"extractor_type": "llm",
"query_timeout": 50000
}
]
}
Task Configuration Fields:
| Field | Type | Default | Description |
|---|---|---|---|
| task_type | string | “information_extraction” | Type of extraction task |
| kwargs | object | {} | Additional task-specific parameters |
| llm_config | object | required | LLM model configuration |
| num_retries | integer | 3 | Retry attempts for failed LLM calls |
| prompt_version | string | “v4” | Version of extraction prompt template |
| extractor_type | string | “llm” | Extractor implementation type |
| query_timeout | integer | 50000 | Timeout for LLM queries in milliseconds |
LLM Configuration Fields:
| Field | Type | Description |
|---|---|---|
| max_tokens | integer | Maximum tokens in LLM response |
| temperature | float | Sampling temperature (0.0 = deterministic, 1.0 = creative) |
| repetition_penalty | float | Penalty for repeated tokens |
| top_p | float | Nucleus sampling parameter |
| top_k | integer | Top-k sampling parameter |
| stop | array | Stop sequences for generation |
| llm_name | string | Model identifier (provider-specific) |
| llm_provider | string | LLM provider (e.g., “bedrock”, “fireworks”, “openai”) |
| context_window | integer | Context window size for the model |
LLM Configuration Recommendations:
For Accuracy (recommended for information extraction):
{
"temperature": 0.0,
"repetition_penalty": 1.0,
"top_p": 0.7,
"top_k": 50,
"max_tokens": 8192
}
Supported LLM Providers:
- AWS Bedrock:
llm_provider: "bedrock"- Models: Llama, Claude, etc.
- Example:
"llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0"
- Fireworks AI
llm_provider: "fireworks" - OpenAI:
llm_provider: "openai"- Models: GPT-4, GPT-3.5
Complete Example:
{
"information_extraction_config": [
{
"task_type": "information_extraction",
"kwargs": {},
"llm_config": {
"max_tokens": 8192,
"temperature": 0.0,
"repetition_penalty": 1.0,
"top_p": 0.7,
"top_k": 50,
"stop": ["---", "</output_format>", "</output_format>\n", "</output_format>\r\n"],
"llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
"llm_provider": "bedrock",
"context_window": 4000
},
"num_retries": 3,
"prompt_version": "v4",
"extractor_type": "llm",
"query_timeout": 50000
}
]
}
Performance Tuning:
- query_timeout: Increase for complex documents or slow LLM endpoints
- num_retries: Increase for unreliable network connections
- context_window: Match to your LLM’s actual context window to avoid truncation
Information extraction significantly increases processing time and costs due to LLM API calls per chunk. Budget accordingly for large document sets.
Complete Configuration Example
This example demonstrates a production-ready configuration for a large dataset:
{
"list_file_parallelism": 10,
"content_reader_parallelism": 30,
"content_indexer_parallelism": 10,
"document_store_type": "google_drive",
"enhance_content": false,
"extract_information": true,
"store_list_file_config": {
"page_size": 100,
"recursive": true,
"document_types": ["document", "pdf"],
"loader_kwargs": {}
},
"store_content_loader_config": {
"num_retries": 3,
"store_loader_kwargs": {}
},
"document_loader_config": {
"pdf": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\n\n", "\n", " ", ""],
"chunk_overlap": 200,
"loader_kwargs": {},
"loader_type": "py_pdf"
},
"document": {
"chunk_size": 1000,
"chunking_enabled": true,
"chunk_separator": ["\n\n", "\n", " ", ""],
"chunk_overlap": 200,
"loader_kwargs": {},
"loader_type": "DocxLoader"
}
},
"information_extraction_config": [
{
"task_type": "information_extraction",
"kwargs": {},
"llm_config": {
"max_tokens": 8192,
"temperature": 0.0,
"repetition_penalty": 1.0,
"top_p": 0.7,
"top_k": 50,
"stop": ["---", "</output_format>", "</output_format>\n", "</output_format>\r\n"],
"llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
"llm_provider": "bedrock",
"context_window": 4000
},
"num_retries": 3,
"prompt_version": "v4",
"extractor_type": "llm",
"query_timeout": 50000
}
]
}
Configuration Best Practices
- Start Conservative: Use default values for initial jobs
- Monitor Performance: Watch for bottlenecks (data source APIs, memory, Stardog load)
- Iterate: Gradually increase parallelism while monitoring
- Test Small: Run a small subset before indexing your entire dataset
- Consider Costs: LLM-based features (enhance_content, extract_information) significantly increase costs
- Match Resources: Ensure Spark cluster and Stardog instance are sized appropriately for your parallelism settings
Querying Indexed Documents
Once documents are indexed, you can query them through the Voicebox UI.
Vector Search Queries
Documents indexed without information extraction can be queried using natural language questions.

Example Questions:
- “What are the main points in the Q4 financial report?”
- “Summarize the product requirements document for Project Alpha”
- “What were the action items from the last board meeting?”
Source Attribution
Voicebox provides source attribution for answers derived from indexed documents.

Hover over “Document Extracted” text to see:
- Source file name
- Page number (for PDFs) or section
- Relevance score
- Direct link to original document (if available)
Knowledge Graph Queries
If information extraction was enabled during indexing, you can ask questions that leverage the knowledge graph.

Example Questions:
- “Who are the key people mentioned in relation to Project Apollo?”
- “What organizations are connected to the merger discussion?”
- “Show me all locations mentioned in the travel policy documents”
Knowledge Graph Benefits:
- Discover relationships across multiple documents
- Find entities and their connections
- Ask graph-based questions (e.g., “What connects X and Y?”)
Source Lineage
Knowledge graph queries provide lineage showing which documents contributed to the answer.

Lineage Information:
- Source documents for each extracted entity
- Document provenance chain
Deployment
This section provides detailed instructions for deploying BITES in your Kubernetes environment.
Deployment Architecture

Components:
- Launchpad: Provides APIs and user interface
- voicebox-service: Manages job lifecycle, interacts with Spark Operator
- Spark Operator: Kubernetes operator for managing Spark applications
- voicebox-bites: Docker image containing the indexing application
- Spark Cluster: Dynamically created driver and executor pods
- Stardog: Database for indexed content and knowledge graphs
Prerequisites
Before deploying BITES, ensure:
- Kubernetes cluster (version 1.19+)
- kubectl configured to access your cluster
- Helm 3.x installed (for Spark Operator installation)
- Spark Operator installed in the cluster
- Access to voicebox-bites Docker image
- Stardog instance deployed and accessible from Kubernetes cluster
Step 1: Install Spark Operator
The Spark Operator manages the lifecycle of Spark applications in Kubernetes.
Installation via Helm:
# Add the Spark Operator Helm repository
helm repo add spark-operator https://kubeflow.github.io/spark-operator
# Update Helm repositories
helm repo update
# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set sparkJobNamespace=default
Verify Installation:
kubectl get pods -n spark-operator
You should see the spark-operator pod running.
Alternative Installation Methods:
See the official Spark Operator documentation for other installation options.
Step 2: Configure Docker Image Access
The voicebox-bites image must be accessible from your Kubernetes cluster.
Option 1: Pull from Stardog JFrog Registry
Request access to the Stardog JFrog registry and configure an image pull secret:
kubectl create secret docker-registry stardog-jfrog-secret \
--docker-server=stardog-stardog-apps.jfrog.io \
--docker-username=YOUR_USERNAME \
--docker-password=YOUR_PASSWORD \
--docker-email=YOUR_EMAIL \
--namespace=default
Option 2: Push to Your Private Registry
- Pull the image from Stardog:
docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current - Tag and push to your registry:
docker tag stardog-stardog-apps.jfrog.io/voicebox-bites:current \ your-registry.com/voicebox-bites:current docker push your-registry.com/voicebox-bites:current - Update image reference in
vbx_bites_kube_config.yaml
Step 3: Configure vbx_bites_kube_config.yaml
The vbx_bites_kube_config.yaml file defines the Spark application specification.
Sample Configuration:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: voicebox-bites-job
namespace: default
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "stardog-stardog-apps.jfrog.io/voicebox-bites:current"
imagePullPolicy: Always
imagePullSecrets:
- stardog-jfrog-secret
mainApplicationFile: local:///app/src/voicebox_bites/etl/bulk_document_extraction.py
sparkVersion: "3.5.0"
restartPolicy:
type: Never
driver:
cores: 2
coreLimit: "2000m"
memory: "4g"
labels:
version: 3.5.0
serviceAccount: spark-operator
executor:
cores: 2
instances: 3
memory: "4g"
labels:
version: 3.5.0
Key Configuration Sections:
Image Configuration:
image: "your-registry.com/voicebox-bites:current"
imagePullPolicy: Always
imagePullSecrets:
- your-image-pull-secret
Driver Configuration (controls the Spark driver):
driver:
cores: 2 # CPU cores for driver
coreLimit: "2000m" # Maximum CPU (Kubernetes format)
memory: "4g" # Memory allocation
serviceAccount: spark-operator
Executor Configuration (controls the Spark executors):
executor:
cores: 2 # CPU cores per executor
instances: 3 # Number of executor pods
memory: "4g" # Memory per executor
Sizing Guidelines:
| Dataset Size | Files | Executor Instances | Executor Memory | Executor Cores |
|---|---|---|---|---|
| Small | <100 | 2-3 | 4g | 2 |
| Medium | 100-1000 | 4-6 | 8g | 4 |
| Large | 1000-10000 | 8-12 | 16g | 4 |
| Very Large | >10000 | 15-30 | 16g | 4 |
Important: BITES does not support Kubernetes autoscaling. Configure a fixed number of executor instances and do not scale down while jobs are running.
Step 4: Configure voicebox-service
The voicebox-service needs to know where to find the Spark configuration.
Set Environment Variable:
env:
- name: VBX_BITES_CONFIG_FILE
value: "/config/vbx_bites_kube_config.yaml"
Mount Configuration File:
volumes:
- name: bites-config
configMap:
name: vbx-bites-config
volumeMounts:
- name: bites-config
mountPath: /config
Create ConfigMap:
kubectl create configmap vbx-bites-config \
--from-file=vbx_bites_kube_config.yaml \
--namespace=default
Step 5: Configure RBAC
Ensure the voicebox-service has permissions to manage Spark applications.
Create Service Account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark-operator
namespace: default
Create Role:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-operator-role
namespace: default
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["create", "get", "list", "delete", "update", "watch"]
- apiGroups: ["sparkoperator.k8s.io"]
resources: ["sparkapplications"]
verbs: ["create", "get", "list", "delete", "update", "watch"]
Create RoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-operator-rolebinding
namespace: default
subjects:
- kind: ServiceAccount
name: spark-operator
namespace: default
roleRef:
kind: Role
name: spark-operator-role
apiGroup: rbac.authorization.k8s.io
Apply RBAC Configuration:
kubectl apply -f spark-rbac.yaml
Step 6: Configure Networking
Ensure proper network connectivity between components.
Required Connectivity:
- Launchpad → voicebox-service (API calls)
- voicebox-service → Kubernetes API (Spark job management)
- Spark executors → Data sources (Google Drive, S3, etc.)
- Spark executors → Stardog (indexing)
- Spark executors → LLM providers (if using content enhancement or information extraction)
Firewall Rules:
- Allow outbound HTTPS (443) for data source APIs
- Allow outbound connections to Stardog endpoint
- Allow outbound connections to LLM provider APIs
Step 7: Deploy voicebox-service
Deploy the voicebox-service with the configured settings.
Cluster Sizing Recommendations
Minimum Cluster Size:
- 3 nodes
- 4 CPU cores per node
- 16 GB RAM per node
Recommended Production Cluster:
- 5-10 nodes
- 8 CPU cores per node
- 32 GB RAM per node
- 100 GB SSD per node (for temporary storage)
Scaling Considerations:
- Each executor needs dedicated resources
- Driver pod requires resources
- Kubernetes system pods consume resources
- Leave 20-30% capacity headroom
Do not enable cluster autoscaling for nodes running Spark executors. Scale the cluster before starting large jobs and maintain the size throughout job execution.
Logging
Comprehensive logging is essential for monitoring job execution and troubleshooting issues.
Logging Architecture
BITES provides two layers of logging:
- Spark Logging: Framework-level logs (job scheduling, task execution, etc.)
- BITES Application Logging: Application-level logs (document processing, API calls, etc.)
Spark Logging Configuration
Spark logging is configured using Log4j properties.
Setup Steps
- Create log4j.properties:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Reduce verbosity of some packages
log4j.logger.org.apache.spark.storage=WARN
log4j.logger.org.apache.spark.scheduler=WARN
log4j.logger.org.apache.spark.util.Utils=WARN
log4j.logger.org.apache.spark.executor=INFO
Logging Levels:
ERROR: Only errorsWARN: Warnings and errorsINFO: Informational messages (recommended for production)DEBUG: Detailed debug information (use for troubleshooting only)TRACE: Very verbose (not recommended)
DEBUG and TRACE levels generate extremely large log volumes. Use only for troubleshooting specific issues.
- Create ConfigMap:
kubectl create configmap spark-log4j-config \
--from-file=log4j.properties \
--namespace=default
- Update vbx_bites_kube_config.yaml:
Add the following under spec:
spec:
sparkConf:
"spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
"spark.executor.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
driver:
configMaps:
- name: spark-log4j-config
path: /opt/spark
executor:
configMaps:
- name: spark-log4j-config
path: /opt/spark
- Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml
BITES Application Logging
Application logging provides insights into document processing, API interactions, and business logic.
Setup Steps
- Create logging.conf:
[loggers]
keys=root,py4j
[logger_py4j]
level=WARN
handlers=nullHandler
qualname=py4j
propagate=0
[handlers]
keys=consoleHandler,nullHandler
[formatters]
keys=simpleFormatter
[logger_root]
level=INFO
handlers=consoleHandler
[handler_nullHandler]
class=logging.NullHandler
level=CRITICAL
args=()
[handler_consoleHandler]
class=voicebox_bites.logging_setup.FlushingStreamHandler
level=INFO
formatter=simpleFormatter
args=(sys.stdout,)
[formatter_simpleFormatter]
format=%(asctime)s %(levelname)s [%(job_id)s] %(name)s - %(message)s
Logging Levels:
INFO: Recommended for productionDEBUG: Detailed processing information (use for troubleshooting)
- Create ConfigMap:
kubectl create configmap voicebox-bites-log-config \
--from-file=logging.conf \
--namespace=default
- Update vbx_bites_kube_config.yaml:
Add the following to both driver and executor sections:
driver:
volumeMounts:
- name: vbx-bites-logging-config-volume
mountPath: /app/etc/logging.conf
subPath: logging.conf
executor:
volumeMounts:
- name: vbx-bites-logging-config-volume
mountPath: /app/etc/logging.conf
subPath: logging.conf
# Add under spec.volumes
volumes:
- name: vbx-bites-logging-config-volume
configMap:
name: voicebox-bites-log-config
- Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml
Custom Log Path:
If you need to use a different path, set the VOICEBOX_BITES_LOG_CONF environment variable:
env:
- name: VOICEBOX_BITES_LOG_CONF
value: "/custom/path/logging.conf"
Accessing Logs
View Driver Logs
# Get driver pod name
DRIVER_POD=$(kubectl get pods -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')
# View logs
kubectl logs $DRIVER_POD
# Follow logs in real-time
kubectl logs -f $DRIVER_POD
# Save logs to file
kubectl logs $DRIVER_POD > driver.log
View Executor Logs
# List executor pods
kubectl get pods -l spark-role=executor
# View specific executor logs
kubectl logs voicebox-bites-job-exec-1
# View all executor logs
kubectl logs -l spark-role=executor
# Follow executor logs
kubectl logs -f voicebox-bites-job-exec-1
Centralized Logging
For production deployments with many executors, use centralized logging.
Troubleshooting
This section covers common issues and their solutions.
Permission Issues
Symptom
Error: Failed to create SparkApplication: User cannot create resource "sparkapplications"
Diagnosis
Check if the service account has the required permissions:
# Check permissions
kubectl auth can-i create sparkapplications --as=system:serviceaccount:default:spark-operator
# Check current role bindings
kubectl get rolebindings -o wide | grep spark-operator
Solution
Ensure proper RBAC configuration:
# Verify service account exists
kubectl get serviceaccount spark-operator
# Verify role exists and has correct permissions
kubectl describe role spark-operator-role
# Verify role binding
kubectl describe rolebinding spark-operator-rolebinding
# If missing, apply RBAC configuration
kubectl apply -f spark-rbac.yaml
See Step 5: Configure RBAC for complete RBAC configuration.
Data Source Authentication Errors
Google Drive: Authentication Failed
Symptom:
ERROR: Authentication failed: Invalid credentials
Common Causes:
- Service account JSON is not properly base64 encoded
- Service account doesn’t have access to the folder
- API not enabled in Google Cloud project
Solutions:
- Verify base64 encoding:
# Encode correctly cat service-account.json | base64 -w 0 # Test decoding echo "YOUR_BASE64_STRING" | base64 -d | jq . - Share folder with service account:
- Copy
client_emailfrom service account JSON - Share the Google Drive folder with this email
- Grant at least “Viewer” permissions
- Copy
- Enable Google Drive API:
- Go to Google Cloud Console
- Navigate to “APIs & Services” → “Library”
- Search for “Google Drive API”
- Click “Enable”
OneDrive/SharePoint: Invalid Client
Symptom:
ERROR: AADSTS7000215: Invalid client secret provided
Solutions:
- Regenerate client secret:
- Secrets expire after a configured period
- Go to Azure Portal → App registrations → Your app → Certificates & secrets
- Create a new secret
- Update credentials JSON and re-encode
- Verify permissions:
- Check that all required permissions are granted
- Ensure admin consent has been provided
- Wait 5-10 minutes after granting permissions
S3: Access Denied
Symptom:
ERROR: Access Denied (Service: Amazon S3; Status Code: 403)
Solutions:
- Verify IAM permissions:
{ "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject" ], "Resource": [ "arn:aws:s3:::your-bucket", "arn:aws:s3:::your-bucket/*" ] } - Check bucket policy:
- Ensure the bucket policy doesn’t deny access
- Verify the IAM role/user is allowed
- Verify region:
- Ensure
region_namein credentials matches bucket region
- Ensure
Job Execution Issues
Job Stuck in SUBMITTED State
Symptom: Job status remains “SUBMITTED” for extended period.
Diagnosis:
# Check Spark Operator logs
kubectl logs -n spark-operator -l app=spark-operator
# Check pending pods
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor
# Describe pending pods
kubectl describe pod $DRIVER_POD_NAME
Common Causes:
- Insufficient cluster resources
- Image pull errors
- RBAC issues
Solutions:
- Insufficient resources:
# Check node resources kubectl describe nodes # Solution: Scale cluster or reduce resource requests - Image pull errors:
# Check events kubectl get events --sort-by='.lastTimestamp' # Solution: Verify image pull secret and image URL
Job Fails with Out of Memory (OOM) Error
Symptom:
ERROR: Executor lost: OutOfMemoryError: Java heap space
Solutions:
- Increase executor memory:
executor: memory: "8g" # Increase from 4g - Reduce parallelism:
{ "content_reader_parallelism": 10 // Reduce from 30 } - Reduce batch size:
{ "batch_size": 500 // Reduce from 1000 } - Increase number of executors (distribute load):
executor: instances: 6 # Increase from 3 memory: "4g" # Keep same memory per executor
Job Fails with Rate Limit Errors
Symptom:
WARN: Rate limit exceeded for Google Drive API
Solutions:
- Reduce parallelism:
{ "list_file_parallelism": 3, "content_reader_parallelism": 5 }
2 Request quota increase from data source provider.
Indexing Issues
Documents Not Appearing in Voicebox
Diagnosis Steps:
- Verify job completed successfully:
curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/$JOB_ID" \ -H "Authorization: Bearer YOUR_API_KEY" - Check driver logs for indexing confirmation:
kubectl logs $DRIVER_POD | grep "Indexed" - Verify Stardog contains data:
SELECT (COUNT(*) as ?count) WHERE { ?s ?p ?o }
Solutions:
- Job failed silently:
- Check logs for errors
- Rerun job with DEBUG logging
- Wrong database or graph:
- Verify Stardog connection details
- Check that Voicebox is querying the correct database/graph
- Stardog connectivity issue:
- Test connectivity from Spark executor to Stardog
- Check firewall rules
Logging Issues
No Logs Appearing
Diagnosis:
# Check if pods exist
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor
# Check pod status
kubectl describe pod $POD_NAME
# Check if ConfigMaps mounted correctly
kubectl exec $DRIVER_POD -- ls -la /opt/spark
kubectl exec $DRIVER_POD -- ls -la /app/etc
Solutions:
- ConfigMap not mounted:
- Verify ConfigMap exists:
kubectl get configmap - Check volumeMounts in pod spec
- Verify path in VOICEBOX_BITES_LOG_CONF
- Verify ConfigMap exists:
- Wrong log level:
- Check logging configuration
- Ensure not set to ERROR or CRITICAL only
- Logs going to wrong destination:
- Verify log handler configuration
- Check stdout/stderr redirection
Logs Too Verbose
Solution:
Change log level from DEBUG to INFO:
For Spark Logs (log4j.properties):
log4j.rootCategory=INFO, console
For BITES Logs (logging.conf):
[logger_root]
level=INFO
Network Issues
Cannot Access Data Source
Symptom:
ERROR: Connection timeout when accessing data source
Solutions:
- Firewall blocking outbound connections:
- Allow HTTPS (443) egress
- Add data source domains to allow list
- Network policy blocking traffic:
- Review network policies
- Add exception for voicebox-bites pods
- DNS resolution issues:
- Check DNS configuration in cluster
- Verify CoreDNS is functioning
Cannot Connect to Stardog
Symptom:
ERROR: Connection refused: Stardog endpoint
Diagnosis:
# Test from driver pod
kubectl exec $DRIVER_POD -- curl -v http://stardog:5820/
# Check Stardog service
kubectl get svc stardog
Solutions:
- Stardog not accessible:
- Verify Stardog is running
- Check service endpoint
- Verify network policies
- Wrong endpoint:
- Check Stardog connection configuration
- Verify port (default 5820)
Getting Help
If you cannot resolve an issue:
- Collect diagnostic information:
- Job ID
- Driver and executor logs
- Spark Operator logs
- Job configuration
- Error messages
- Check documentation:
Docker Image Availability
Latest Version
Pull the most recent voicebox-bites image:
docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current
Specific Versions
Pull a specific version (e.g., v0.2.0):
docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:v0.2.0