Link Search Menu Expand Document
Start for Free

Using Unstructured Data with Voicebox

A comprehensive guide for developers to integrate and manage unstructured data with Stardog Voicebox using the BITES (Blob Indexing and Text Enrichment with Semantics) system.

Page Contents
  1. Overview
    1. What is BITES?
    2. Supported Data Sources
    3. Supported Document Formats
    4. Key Capabilities
    5. Architecture Overview
  2. Quick Start
    1. Prerequisites Checklist
    2. 5-Minute Setup (Google Drive)
  3. Prerequisites and Requirements
    1. Data Source Configuration
      1. Google Drive
      2. OneDrive
      3. Microsoft SharePoint
      4. Dropbox
      5. Amazon S3
      6. Local Storage
    2. API Key Generation
    3. Stardog Connection Configuration
      1. Option 1: Stardog JWT Token
      2. Option 2: SSO Authentication (Azure/Okta/Ping)
    4. LLM Provider Configuration
      1. Required Environment Variables by Provider
        1. AWS Bedrock
        2. Fireworks AI
        3. OpenAI
        4. Azure OpenAI
      2. Configuring Environment Variables in Kubernetes
    5. Deployment Prerequisites
  4. API Reference
    1. Authentication
    2. Base URL
    3. API Endpoints Overview
    4. Initiate Indexing Job
      1. Endpoint
      2. Request Headers
      3. Request Body Parameters
      4. Credentials Format by Data Source
      5. Minimal Request Example
      6. Complete Request Example with All Options
      7. Response
      8. Response Fields
    5. Get Job Status
      1. Endpoint
      2. Path Parameters
      3. Query Parameters
      4. Request Example
      5. Response
      6. Response Fields
      7. Status Codes
      8. Polling Recommendations
    6. Cancel Job
      1. Endpoint
      2. Path Parameters
      3. Request Body Parameters
      4. Request Example
      5. Response
      6. Response Fields
  5. Document Indexing Pipeline
    1. Pipeline Stages
      1. 1. List Directories
      2. 2. List Supported Files
      3. 3. Fetch Content and Metadata
      4. 4. Parse and Chunk Content
      5. 5. Information Extraction (Optional)
      6. 6. Index Chunks
    2. Pipeline Performance Considerations
  6. Job Configuration
    1. Minimal Configuration
    2. Complete Configuration Reference
    3. Performance Tuning
    4. Document Store Type
    5. extract_information
    6. store_list_file_config
    7. store_content_loader_config
    8. document_loader_config
    9. information_extraction_config
      1. Task Configuration Fields
      2. Extractor Types
      3. LLM Configuration
      4. Advanced IE Configuration (kwargs)
        1. Entity Scope
        2. Custom IE Instructions
        3. Entity Resolution
        4. Batch Processing
      5. Complete IE Configuration Example
  7. Querying Indexed Documents
    1. Vector Search Queries
    2. Source Attribution
    3. Knowledge Graph Queries
    4. Source Lineage
  8. Deployment
    1. Deployment Architecture
    2. Prerequisites
    3. Step 1: Install Spark Operator
    4. Step 2: Configure Docker Image Access
    5. Step 3: Configure vbx_bites_kube_config.yaml
    6. Step 4: Configure voicebox-service
    7. Step 5: Configure RBAC
    8. Step 6: Configure Networking
    9. Step 7: Deploy voicebox-service
    10. Cluster Sizing Recommendations
  9. Logging
    1. Logging Architecture
    2. Spark Logging Configuration
      1. Setup Steps
    3. BITES Application Logging
      1. Setup Steps
    4. Accessing Logs
      1. View Driver Logs
      2. View Executor Logs
    5. Centralized Logging
  10. Troubleshooting
    1. Permission Issues
      1. Symptom
      2. Diagnosis
      3. Solution
    2. Data Source Authentication Errors
      1. Google Drive: Authentication Failed
      2. OneDrive/SharePoint: Invalid Client
      3. S3: Access Denied
    3. Job Execution Issues
      1. Job Stuck in SUBMITTED State
      2. Job Fails with Out of Memory (OOM) Error
      3. Job Fails with Rate Limit Errors
    4. Indexing Issues
      1. Documents Not Appearing in Voicebox
    5. Logging Issues
      1. No Logs Appearing
      2. Logs Too Verbose
    6. Network Issues
      1. Cannot Access Data Source
      2. Cannot Connect to Stardog
    7. Getting Help
  11. Docker Image Availability
    1. Latest Version
    2. Specific Versions
  12. Additional Resources

Overview

BITES (Blob Indexing and Text Enrichment with Semantics) is Stardog Voicebox’s unstructured data support system. It enables ingestion of documents from various cloud storage providers and local sources, allowing users to query both structured and unstructured data through Voicebox’s conversational AI interface.

What is BITES?

BITES provides an API-first approach to indexing and querying unstructured documents alongside your structured data in Stardog. The system leverages Apache Spark for distributed processing and integrates with your existing Kubernetes infrastructure.

Supported Data Sources

  • Google Drive - Cloud document storage
  • Microsoft OneDrive - Personal and business cloud storage
  • Microsoft SharePoint - Enterprise document management (Document Library only)
  • Dropbox - Cloud file storage
  • Amazon S3 - Object storage
  • Local Storage - File system accessible from Kubernetes environment

Supported Document Formats

  • Microsoft Word (DOCX)
  • PDF

The system currently supports parsing and indexing of textual and tabular data. Image parsing within documents is planned for a future release.

Key Capabilities

  • Data Ingestion: Automated ingestion from multiple data source types
  • Unified Querying: Query both structured and unstructured data through a single Voicebox interface
  • API-First Design: All functionality accessible through Launchpad’s public APIs
  • Distributed Processing: Spark-based job execution in your Kubernetes environment
  • Job Management: Full lifecycle management including status monitoring and cancellation
  • Vector Indexing: Document chunks indexed in Stardog’s vector store for semantic search
  • Knowledge Graph Creation: Optional extraction of entities and relationships from documents to build knowledge graphs

Beta Features: Information extraction and knowledge graph creation are currently in Beta. These features enable extraction of structured entities and relationships from unstructured text.

Architecture Overview

Architecture Diagram

System Flow:

  1. User initiates data ingestion and indexing via Launchpad’s public APIs
  2. Voicebox service creates and submits a Spark job to the Kubernetes cluster
  3. Spark job processes documents: reads from source, parses content, chunks text
  4. Processed chunks are indexed in Stardog’s vector store
  5. (Optional) If information extraction is enabled, entities and relationships are extracted and stored as a knowledge graph
  6. User queries Voicebox, which retrieves answers from both structured data and indexed documents

Quick Start

This section provides a fast-track setup example for indexing Google Drive documents.

Prerequisites Checklist

  • Kubernetes cluster with Spark Operator installed
  • Voicebox service and voicebox-bites containers running
  • Launchpad API key generated
  • Data source credentials configured
  • Stardog connection configured

5-Minute Setup (Google Drive)

  1. Configure Google Drive (see Google Drive Configuration)
  2. Get API Key from Launchpad → “Manage API Keys”
  3. Base64 encode your Google service account JSON
  4. Call the API (see example below)
  5. Monitor job using the job_id returned
# Example: Initiate indexing job
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "directory": "your-folder-id",
    "credentials": "BASE64_ENCODED_SERVICE_ACCOUNT_JSON",
    "job_name": "my-indexing-job",
    "job_config": {
      "document_store_type": "google_drive",
      "extract_information": false
    }
  }'

Prerequisites and Requirements

Data Source Configuration

Before using BITES APIs, you must configure access credentials for your data sources. Each provider requires specific setup steps.

Google Drive

Setup Steps:

  1. Navigate to Google Cloud Console
  2. Create a new project or select an existing one
  3. Enable the Google Drive API
  4. Go to “APIs & Services” → “OAuth consent screen” and configure
  5. Go to “Credentials” → “Create Credentials” → “Service Account”
  6. Create a Service Account with appropriate permissions
  7. In the Service Account details, go to the “Keys” tab
  8. Click “Add Key” → “Create new key” → “JSON”
  9. Download the JSON key file

Required API Scope:

Google Drive Scope

Add the following scope: https://www.googleapis.com/auth/drive.readonly

IAM Configuration:

Google Drive Key Setup

Service Account JSON Structure:

{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "your-private-key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n",
  "client_email": "your-service-account@your-project.iam.gserviceaccount.com",
  "client_id": "your-client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-service-account",
  "universe_domain": "googleapis.com"
}

Security Note: Store service account credentials securely. Never commit them to version control. Use environment variables or secret management systems.

Sharing Documents:

To allow the service account to access specific folders:

  1. Copy the client_email from the service account JSON
  2. Share the Google Drive folder with this email address
  3. Grant appropriate permissions (Viewer for read-only access)

OneDrive

Setup Steps:

  1. Go to Azure Portal
  2. Navigate to “Azure Active Directory” → “App registrations”
  3. Click “New registration”
  4. Provide a name and configure redirect URIs if needed
  5. After registration, note the Application (client) ID and Directory (tenant) ID from the overview page
  6. Go to “Certificates & Secrets” → “Client secrets” → “New client secret”
  7. Copy the secret value (not the secret ID)

Required Microsoft Graph Permissions:

  • Files.Read
  • Files.ReadAll (Delegated)
  • Files.ReadAll (Application)
  • offline_access (Delegated)
  • openid (Delegated)
  • Sites.ReadAll (Delegated)
  • User.Read (Delegated)

Azure Permissions

Grant Admin Consent:

After adding permissions, click “Grant admin consent” to approve them for your organization.

Secret Configuration:

Azure Secret

Credentials JSON Structure:

{
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}

Microsoft SharePoint

Only Document Library is currently supported for SharePoint.

Setup Steps:

  1. Create an Entra ID (Azure Active Directory) on the Azure Portal
  2. Create and register an application in Entra ID
  3. Note the Application (client) ID from the overview page
  4. Go to “Certificates & Secrets” and create a new client secret
  5. Copy the secret value

Required Microsoft Graph Permissions:

  • Files.Read
  • Files.ReadAll (Delegated)
  • Sites.ReadAll (Delegated)
  • Sites.ReadAll (Application)
  • User.Read (Delegated)

Required SharePoint Permissions:

  • AllSites.Read (Delegated)
  • MyFiles.Read (Delegated)
  • Sites.Read.All (Application)
  • Sites.Select.All (Application)
  • User.Read.All (Application)

SharePoint Permissions

Some permissions require M365 administrator approval. Contact your administrator to grant these permissions.

Credentials JSON Structure:

{
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}

Additional Required Information:

When calling the indexing API for SharePoint, you must also provide:

  • host_name: Your SharePoint host (e.g., “yourcompany.sharepoint.com”)
  • site_id: The SharePoint site ID
  • library_name: The document library name

Dropbox

Setup Steps:

  1. Go to Dropbox App Console
  2. Click “Create app”
  3. Choose “Scoped access” and “Full Dropbox” or “App folder”
  4. Provide an app name
  5. Go to the “Permissions” tab and enable required scopes:
    • files.metadata.read
    • files.content.read
  6. Note your App key and App secret

OAuth Authorization Flow:

  1. In your browser, visit:
    https://www.dropbox.com/oauth2/authorize?client_id=<APP_KEY>&token_access_type=offline&response_type=code
    

    Replace <APP_KEY> with your actual app key.

  2. Log in to Dropbox and approve the app

  3. Copy the authorization code from the redirect URL

  4. Exchange the authorization code for tokens:
    curl -X POST https://api.dropboxapi.com/oauth2/token \
      -d code=<AUTHORIZATION_CODE> \
      -d grant_type=authorization_code \
      -d client_id=<APP_KEY> \
      -d client_secret=<APP_SECRET>
    
  5. The response contains:
    • access_token: Current access token
    • refresh_token: Long-lived token for obtaining new access tokens

Credentials JSON Structure:

{
  "access_token": "your-current-access-token",
  "refresh_token": "your-refresh-token",
  "client_id": "your-app-key",
  "client_secret": "your-app-secret"
}

The BITES connector automatically handles token refresh. If the access token expires, it uses the refresh token to obtain a new one.

Amazon S3

BITES supports two authentication options for S3: IAM roles (recommended) and access keys.

Option 1: IAM Roles (Recommended for AWS-hosted applications)

  1. Go to AWS Management Console → IAM
  2. Click “Roles” → “Create role”
  3. Select the service that will assume this role (e.g., EC2, EKS)
  4. Attach permissions policies:
    • AmazonS3ReadOnlyAccess (for read-only access)
    • Or create a custom policy with minimal required permissions
  5. Review and create the role
  6. Attach this role to your Kubernetes nodes or pods

Option 2: Access Key and Secret Key

  1. Go to AWS Management Console → IAM
  2. Click “Users” → “Add user”
  3. Provide a username and select “Programmatic access”
  4. Attach permissions policies (e.g., AmazonS3ReadOnlyAccess)
  5. Complete user creation
  6. Download the CSV file containing the Access Key ID and Secret Access Key

Security Best Practice: Use IAM roles when possible. If using access keys, rotate them regularly and store them securely.

Required S3 Permissions:

Your IAM role or user must have the following permissions on the target bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucketVersions"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Bucket Configuration:

  1. Open AWS Console and navigate to S3
  2. Select your bucket
  3. Go to “Permissions” tab
  4. Ensure the IAM role/user has the required permissions listed above

Credentials JSON Structure:

{
  "aws_access_key_id": "your-access-key",
  "aws_secret_access_key": "your-secret-key",
  "region_name": "us-east-1",
  "use_iam_role": false
}

For IAM Role Authentication:

{
  "region_name": "us-east-1",
  "use_iam_role": true
}

Additional Required Information:

When calling the indexing API for S3, you must also provide:

  • bucket: The S3 bucket name in the extra_args parameter

Local Storage

Local storage allows indexing of files directly accessible from the Kubernetes environment.

Requirements:

  • The directory must be accessible from the Spark executors
  • Use Kubernetes volumes (PersistentVolumes, ConfigMaps, or mounted storage)
  • Ensure appropriate read permissions

Credentials:

No credentials are required for local storage. Pass an empty JSON object (base64 encoded):

{}

Directory Path:

Provide the absolute path to the directory within the container filesystem (e.g., /mnt/data/documents).

API Key Generation

All BITES APIs require authentication using a Launchpad API key.

Steps to Generate API Key:

  1. Log in to Launchpad
  2. Navigate to “Manage API Keys”
  3. Click “Create API Key”
  4. Provide a name and select the database
  5. Copy and securely store the API key

API keys provide full access to your Voicebox instance. Store them securely and never expose them in client-side code or public repositories.

Using the API Key:

Include the API key in the Authorization header of all API requests:

Authorization: Bearer YOUR_API_KEY

Stardog Connection Configuration

BITES jobs need to connect to Stardog to index documents. Two authentication options are supported.

Option 1: Stardog JWT Token

Prerequisites:

  • A valid JWT token issued by your Stardog server (the same server provided as the endpoint in the API)

How it Works:

Obtain a JWT token from the Stardog server and pass it via the X-SD-Auth-Token header when calling the BITES API. The token is forwarded to the Spark job for authenticating against Stardog during indexing.

Token Expiry: Set token expiry based on the expected job duration. For very long-running jobs, consider breaking the work into smaller batches to avoid creating tokens with very long expiration times.

Option 2: SSO Authentication (Azure/Okta/Ping)

Prerequisites:

  • SSO provider configured and integrated with Stardog
  • Valid refresh token from SSO provider
  • SSO provider client ID

How it Works:

When you initiate a job, the system calls the SSO provider to fetch an access token using the refresh token. This access token is then passed to the Spark job.

Configuration:

When calling the indexing API, provide:

  • sso_provider_client_id: Your SSO provider’s client ID
  • refresh_token: A valid refresh token from your SSO provider

LLM Provider Configuration

If you plan to use information extraction (extract_information: true), you must configure environment variables for your LLM provider. These variables must be available on both the driver and executor nodes.

Required Environment Variables by Provider

AWS Bedrock
env:
  - name: AWS_ACCESS_KEY_ID
    value: "your-aws-access-key-id"
  - name: AWS_SECRET_ACCESS_KEY
    value: "your-aws-secret-access-key"
  - name: AWS_REGION
    value: "us-east-1"

For AWS Bedrock, ensure your IAM user/role has bedrock:InvokeModel permission for the models you plan to use.

Fireworks AI
env:
  - name: FIREWORKS_API_KEY
    value: "your-fireworks-api-key"
OpenAI
env:
  - name: OPENAI_API_KEY
    value: "your-openai-api-key"
Azure OpenAI
env:
  - name: AZURE_OPENAI_API_KEY
    value: "your-azure-openai-key"
  - name: AZURE_OPENAI_ENDPOINT
    value: "https://your-resource.openai.azure.com/"

Configuring Environment Variables in Kubernetes

Option 1: Using Kubernetes Secrets (Recommended). Then reference the secret in vbx_bites_kube_config.yaml.

Option 2: Direct Environment Variables (Not Recommended for Production)

Deployment Prerequisites

Before running indexing jobs, ensure your Kubernetes environment is properly configured.

Required Components:

  • Kubernetes cluster (version 1.19+)
  • Spark Operator installed in the cluster
  • voicebox-bites Docker image accessible
  • voicebox-service running and configured
  • Network connectivity between Launchpad, voicebox-service, and voicebox-bites

See the Deployment section for detailed setup instructions.


API Reference

All BITES functionality is accessed through RESTful APIs. This section provides complete API documentation with examples.

Authentication

All API requests must include an Authorization header with your Launchpad API key:

Authorization: Bearer YOUR_API_KEY

Base URL

https://your-launchpad-url/api/v1/voicebox/bites

Replace your-launchpad-url with your actual Launchpad instance URL.

API Endpoints Overview

Endpoint Method Description
/jobs POST Initiate a new indexing job
/jobs/{job_id} GET Get the status of a job
/jobs/{job_id}/cancel POST Cancel a running job

Initiate Indexing Job

Creates and starts a new indexing job in the Spark environment.

Endpoint

POST /api/v1/voicebox/bites/jobs

Request Headers

Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

Request Body Parameters

Parameter Type Required Description
database string Yes Stardog database name where indexing output (vector chunks, knowledge graphs) will be stored
endpoint string Yes Stardog endpoint URL to connect to (e.g., https://your-stardog-instance:5820)
model string No Model/ontology name in Stardog (e.g., my_org:c360). When information extraction is enabled, this ontology defines the entity types and relationships that the system will extract
directory string Yes Directory location or ID. For Google Drive: folder ID; OneDrive: folder path; Local: absolute path
credentials string Yes Base64-encoded JSON containing data source credentials. See Data Source Configuration for format
job_name string Yes Unique name for the job (used for tracking and management)
job_namespace string No Kubernetes namespace for the job. Defaults to namespace in vbx_bites_kube_config.yaml
batch_size integer No Number of chunks to commit at once. Default: 1000. Increase for better performance, decrease if memory constrained
job_config object Yes Configuration controlling scalability and functionality. See Job Configuration
sso_provider_client_id string Conditional Required for SSO authentication. SSO provider’s client ID
refresh_token string Conditional Required for SSO authentication. Valid refresh token from SSO provider
extra_args object No Additional arguments specific to data source type. See below

Extra Args by Data Source:

Data Source Extra Args Required Example
OneDrive one_drive_id {"one_drive_id": "b!drive_id"}
SharePoint host_name, site_id, library_name {"host_name": "company.sharepoint.com", "site_id": "site-id", "library_name": "Documents"}
S3 bucket, prefix {"bucket": "my-documents-bucket", "prefix: "S3 path" }
Google Drive None -
Dropbox None -
Local None -

Credentials Format by Data Source

Before passing to the API, base64-encode the JSON credentials. See Data Source Configuration for the required JSON structure for each provider.

Minimal Request Example

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "my_database",
    "endpoint": "https://your-stardog-instance:5820",
    "directory": "1A2B3C4D5E6F7G8H9I",
    "credentials": "eyJ0eXBlIjoic2VydmljZV9hY2NvdW50IiwicHJvamVjdF9pZCI6InlvdXItcHJvamVjdCJ9",
    "job_name": "index-google-drive-docs",
    "job_config": {
      "document_store_type": "google_drive",
      "extract_information": false
    }
  }'

Complete Request Example with All Options

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "my_database",
    "endpoint": "https://your-stardog-instance:5820",
    "model": "my_org:c360",
    "directory": "b!AbCdEf123456",
    "credentials": "eyJ0ZW5hbnRfaWQiOiJ5b3VyLXRlbmFudC1pZCIsImNsaWVudF9pZCI6InlvdXItY2xpZW50LWlkIiwiY2xpZW50X3NlY3JldCI6InlvdXItc2VjcmV0In0=",
    "job_name": "onedrive-quarterly-reports",
    "job_namespace": "voicebox-production",
    "batch_size": 2000,
    "sso_provider_client_id": "your-sso-client-id",
    "refresh_token": "your-refresh-token",
    "extra_args": {
      "one_drive_id": "b!AbCdEf123456"
    },
    "job_config": {
      "list_file_parallelism": 10,
      "content_reader_parallelism": 20,
      "content_indexer_parallelism": 10,
      "document_store_type": "onedrive",
      "extract_information": true,
      "store_list_file_config": {
        "page_size": 100,
        "recursive": true,
        "document_types": ["document", "pdf"]
      },
      "store_content_loader_config": {
        "num_retries": 3,
        "store_loader_kwargs": {}
      },
      "document_loader_config": {
        "pdf": {
          "chunk_size": 1000,
          "chunking_enabled": true,
          "chunk_separator": ["\\n\\n", "\\n", ". ", " ", ""],
          "chunk_overlap": 200
        },
        "document": {
          "chunk_size": 1000,
          "chunking_enabled": true,
          "chunk_separator": ["\\n\\n", "\\n", ". ", " ", ""],
          "chunk_overlap": 200
        }
      },
      "information_extraction_config": [
        {
          "task_type": "information_extraction",
          "extractor_type": "llm",
          "llm_config": {
            "max_tokens": 8192,
            "temperature": 0.0,
            "repetition_penalty": 1.0,
            "top_p": 0.7,
            "top_k": 50,
            "stop": ["---", "</output_format>"],
            "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
            "llm_provider": "bedrock",
            "context_window": 128000
          },
          "num_retries": 3,
          "query_timeout": 50000
        }
      ]
    }
  }'

Response

Success Response (HTTP 200):

{
  "job_id": "spark-app-1234567890-abcdef",
  "error": null
}

Error Response (HTTP 400/500):

{
  "job_id": null,
  "error": "Failed to create job: Invalid credentials format"
}

Response Fields

Field Type Description
job_id string or null Unique identifier for the created job. Use this to check status or cancel the job
error string or null Error message if job creation failed, null otherwise

Get Job Status

Retrieves the current status of an indexing job.

Endpoint

GET /api/v1/voicebox/bites/jobs/{job_id}

Path Parameters

Parameter Type Required Description
job_id string Yes Job ID returned when the job was created

Query Parameters

Parameter Type Required Description
job_namespace string No Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml

Request Example

curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

Success Response (HTTP 200):

{
  "status_code": "RUNNING",
  "status": "Job is processing documents. Completed 45 of 100 files."
}

Job Not Found (HTTP 404):

{
  "status_code": "UNKNOWN",
  "status": "Job not found"
}

Response Fields

Field Type Description
status_code string Current state of the job. See status codes below
status string Human-readable status message with additional details

Status Codes

Status Code Description
NEW Job created but not yet submitted to Spark
SUBMITTED Job submitted to Spark cluster, waiting for resources
RUNNING Job actively processing documents
PENDING_RERUN Job failed and is waiting to be retried
INVALIDATING Job is being invalidated
SUCCEEDING Job is in the process of completing successfully
COMPLETED Job finished successfully
ERROR Job encountered a non-recoverable error
FAILING Job is in the process of failing
FAILED Job failed
UNKNOWN Job status cannot be determined (job may not exist)

Polling Recommendations

  • Poll every 10-30 seconds for jobs expected to complete quickly
  • Poll every 1-5 minutes for long-running jobs
  • Stop polling when status is COMPLETED, FAILED, or ERROR

Cancel Job

Cancels a running or pending indexing job.

Endpoint

POST /api/v1/voicebox/bites/jobs/{job_id}/cancel

Path Parameters

Parameter Type Required Description
job_id string Yes Job ID of the job to cancel

Request Body Parameters

Parameter Type Required Description
job_name string Yes Name of the job to cancel (must match the name used when creating the job)
job_namespace string No Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml

Request Example

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "job_name": "index-google-drive-docs"
  }'

Response

Success Response (HTTP 200):

{
  "success": true,
  "error": null
}

Error Response (HTTP 400/500):

{
  "success": false,
  "error": "Job not found or already completed"
}

Response Fields

Field Type Description
success boolean True if the job was successfully canceled, false otherwise
error string or null Error message if cancellation failed, null otherwise

Canceling a job may take a few moments. The Spark operator will gracefully terminate the running executors. Already indexed documents will remain in Stardog.


Document Indexing Pipeline

Understanding the indexing pipeline helps you configure jobs effectively and troubleshoot issues.

Pipeline Diagram

Pipeline Stages

1. List Directories

Purpose: Enumerate all directories within the specified location.

Configuration: Controlled by list_file_parallelism and recursive settings.

Output: List of directories to scan for files.

2. List Supported Files

Purpose: Identify all supported document types (PDF, DOCX) within the directories.

Configuration: Filtered by document_types in store_list_file_config.

Output: List of file paths/IDs with metadata (size, modification date, etc.).

3. Fetch Content and Metadata

Purpose: Download file content from the data source and extract metadata.

Configuration: Controlled by content_reader_parallelism. Retries configured via num_retries.

Metadata Captured:

  • File name and path
  • Creation and modification dates
  • Author information (if available)
  • File size
  • MIME type

Output: Raw file content and associated metadata.

4. Parse and Chunk Content

Purpose: Extract text from documents and split into manageable chunks.

Parsing:

  • PDF: Text and Table extraction from PDF files
  • DOCX: Text and table extraction from Word documents

Chunking:

  • Split text based on chunk_size and chunk_separator
  • Apply chunk_overlap to preserve context between chunks
  • Maintain metadata association with each chunk

Configuration: Controlled by document_loader_config.

Output: Array of text chunks with metadata.

5. Information Extraction (Optional)

Purpose: Extract structured entities and relationships to build a knowledge graph.

When to Use:

  • You want to build a knowledge graph from unstructured documents
  • Need to identify entities (people, organizations, locations) and their relationships
  • Want to enable graph-based queries alongside vector search

Process:

  • LLM analyzes each chunk to identify entities and relationships
  • Extracts triples in RDF format (subject-predicate-object)
  • Optionally resolves entities within each document

Configuration: Set extract_information: true and configure information_extraction_config.

Entity resolution operates at the document level. Entities are resolved and linked within each document’s context, not across the entire dataset.

Output: RDF triples representing extracted knowledge.

Cost Consideration: This step makes additional LLM API calls per chunk, significantly increasing processing time and cost.

6. Index Chunks

Purpose: Store processed chunks and knowledge graph in Stardog.

Indexing Operations:

  • Vector Indexing: Chunks are embedded and stored in Stardog’s vector store
  • Metadata Indexing: File metadata stored for filtering and source attribution
  • Knowledge Graph Storage: Extracted triples stored in specified graph (if information extraction enabled)

Configuration: Controlled by content_indexer_parallelism and batch_size.

Output: Indexed and searchable content in Stardog.

Pipeline Performance Considerations

Bottlenecks:

  1. Data Source API Limits: Google Drive, OneDrive have rate limits
  2. Network Bandwidth: Large files take time to download
  3. LLM Processing: Content enhancement and information extraction are slow
  4. Stardog Ingestion: High parallelism can overload Stardog

Optimization Tips:

  • Start with conservative parallelism settings
  • Monitor data source rate limits
  • Use content enhancement and information extraction only when necessary
  • Increase batch_size for better ingestion throughput
  • Scale Stardog appropriately for your indexing load

Job Configuration

The job_config parameter controls both scalability and functionality of the indexing pipeline. Most options have sensible defaults — you only need to specify what you want to customize.

Minimal Configuration

{
  "document_store_type": "google_drive"
}

This uses all defaults: PDF and DOCX document types, recursive file listing, no information extraction.

Complete Configuration Reference

Below is a complete configuration showing all available options with their defaults:

{
  "list_file_parallelism": 5,
  "content_reader_parallelism": 10,
  "content_indexer_parallelism": 5,
  "document_store_type": "google_drive",
  "extract_information": false,
  "store_list_file_config": {
    "page_size": 100,
    "recursive": true,
    "document_types": ["document", "pdf"],
    "loader_kwargs": {}
  },
  "store_content_loader_config": {
    "num_retries": 2,
    "store_loader_kwargs": {}
  },
  "document_loader_config": {
    "pdf": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", ". ", " ", ""],
      "chunk_overlap": 0,
      "loader_type": "py_pdf",
      "loader_kwargs": {}
    },
    "document": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", ". ", " ", ""],
      "chunk_overlap": 0,
      "loader_type": "DocxLoader",
      "loader_kwargs": {}
    }
  },
  "information_extraction_config": [{
    "task_type": "information_extraction",
    "extractor_type": "llm",
    "kwargs": {},
    "num_retries": 3,
    "query_timeout": 50000,
    "llm_config": {
      "max_tokens": 8192,
      "temperature": 0,
      "context_window": 128000,
      "stop": ["---", "</output_format>"]
    }
  }]
}

Performance Tuning

Parameter Default Description
list_file_parallelism 5 Parallel tasks for discovering files from document stores
content_reader_parallelism 10 Parallel tasks for reading and parsing documents (most impactful setting)
content_indexer_parallelism 5 Parallel tasks for indexing into Stardog vector store

Higher parallelism increases pressure on data source APIs and Stardog. Monitor for rate limit errors and resource utilization (CPU, memory, disk I/O).

Large document sets (10K+ documents):

{
  "list_file_parallelism": 10,
  "content_reader_parallelism": 50,
  "content_indexer_parallelism": 5
}

Memory-constrained environments:

{
  "list_file_parallelism": 2,
  "content_reader_parallelism": 3,
  "content_indexer_parallelism": 2
}

Document Store Type

Store document_store_type Value Auth Required
Google Drive google_drive Service account JSON (base64)
Dropbox dropbox OAuth token (base64)
OneDrive onedrive OAuth credentials (base64)
SharePoint sharepoint OAuth credentials (base64)
Amazon S3 s3 AWS credentials (base64)
Local local None

extract_information

Set to true to enable LLM-based entity and relationship extraction for building a knowledge graph. Default: false.

When enabled, the system extracts entities and relationships based on the ontology defined in the model specified in the API request, storing them as RDF triples in Stardog. The extraction is schema-driven — it does not create new entity types but extracts instances of types defined in your ontology. This increases processing time by 5-20x and incurs additional LLM API costs per chunk. Requires information_extraction_config to be configured with LLM settings.

store_list_file_config

Controls file discovery behavior.

Field Type Default Description
page_size integer 100 Files fetched per API call (typical range: 50-200)
recursive boolean true Scan subdirectories
document_types array ["document", "pdf"] File types to process ("document" = DOCX)
loader_kwargs object {} Store-specific options

store_content_loader_config

Controls file content fetching behavior.

Field Type Default Description
num_retries integer 2 Retry attempts for failed downloads (2-3 recommended)
store_loader_kwargs object {} Store-specific options

For S3 data sources, you can pass streaming_threshold_mb (default: 20) in store_loader_kwargs to control the file size threshold above which S3 objects are streamed rather than downloaded fully into memory.

document_loader_config

Defines parsing and chunking strategy per document type. Configure separately for "pdf" and "document" (DOCX):

Field Type Default Description
chunk_size integer 1000 Maximum characters per chunk (300-500 for precision, 800-1200 balanced, 1500-2000 for speed)
chunking_enabled boolean true Enable text chunking (disable only for very small documents)
chunk_separator array ["\n\n", "\n", ". ", " ", ""] Priority-ordered separators for splitting text
chunk_overlap integer 0 Characters to overlap between chunks (100-200 for general, 300-500 for complex documents)
loader_type string varies Parser type: "py_pdf" for PDFs, "DocxLoader" for DOCX
loader_kwargs object {} Parser-specific parameters

You can use different settings for PDFs vs. DOCX:

{
  "document_loader_config": {
    "pdf": {
      "chunk_size": 1200,
      "chunk_overlap": 200
    },
    "document": {
      "chunk_size": 800,
      "chunk_overlap": 100
    }
  }
}

information_extraction_config

Configures entity and relationship extraction. Required when extract_information: true.

Task Configuration Fields

Field Type Default Description
task_type string "information_extraction" Type of extraction task
extractor_type string "llm" Extractor implementation: llm, spacy, or nltk (see below)
kwargs object {} Optional. Advanced IE parameters for fine-tuning (see Advanced IE Configuration)
llm_config object required LLM model configuration
num_retries integer 3 Retry attempts for failed operations
query_timeout integer 50000 Timeout in milliseconds for Stardog queries (schema fetching for LLM IE, search queries for SpaCy/NLTK)

Extractor Types

  • llm: LLM-based extraction using configured LLM provider (most flexible, requires LLM config)
  • spacy: SpaCy NER model detects entities, then LLM maps entity types using an internal job-specific cache to minimize LLM calls
  • nltk: NLTK NER (lightweight, no LLM required)

LLM Configuration

Field Type Default Description
max_tokens integer 8192 Maximum tokens in LLM response
temperature float 0.0 Sampling temperature (0.0 = deterministic)
context_window integer 128000 Context window size — match to your LLM’s actual window
stop array ["---", "</output_format>"] Stop sequences for generation
repetition_penalty float 1.0 Penalty for repeated tokens
top_p float Nucleus sampling parameter
top_k integer Top-k sampling parameter
llm_name string Model identifier (provider-specific)
llm_provider string LLM provider identifier

Contact Stardog support to get the list of currently supported LLM providers and their available models.

Advanced IE Configuration (kwargs)

All kwargs parameters are optional and not required for basic information extraction. The defaults work well for most use cases. Consider tuning these only after running an initial extraction and reviewing the results.

Key Type Default Description
bites_ie_entity_scope object {} Controls IRI uniqueness scope for extracted entities
bites_ie_instructions string "" Custom domain-specific instructions injected into the IE prompt
bites_ie_max_chunks_per_llm_call int auto Max sequential chunks clubbed into a single batch per LLM call for IE (auto-calculated from context window)
bites_ie_schema_sample_count int 8 Number of chunks sampled from a document to estimate schema token usage when building batch sizes based on the context window
bites_ie_enable_entity_resolution bool false Enable document-level entity resolution
bites_ie_er_max_entities_per_call int 50 Max entities per entity resolution call
bites_ie_er_instructions string "" Custom instructions for entity resolution
Entity Scope

By default, all entities use GLOBAL_SCOPE — the same entity text produces the same IRI across all documents, which means entities are automatically linked across your entire document set. Use DOCUMENT_SCOPE when an entity type is only meaningful within a single document (e.g., “Agreement” or “Contract” where each document has its own distinct instance).

Scope Description
GLOBAL_SCOPE IRIs are globally unique — same entity text produces the same IRI across all documents (default). Best for entities like people, organizations, and locations that span multiple documents
DOCUMENT_SCOPE IRIs include document ID suffix — entities are unique per document. Best for document-specific concepts like agreements, contracts, or service terms

Example:

"bites_ie_entity_scope": {
  "GLOBAL_SCOPE": ["Party", "Company"],
  "DOCUMENT_SCOPE": ["Agreement", "Service"]
}
Custom IE Instructions

Use bites_ie_instructions to inject domain-specific guidance into the extraction prompt. This helps the LLM focus on the most relevant entities and relationships for your use case.

We recommend an iterative approach: first run extraction on a sample document without custom instructions, review the results, then add instructions to address gaps or improve focus. Instructions can be extensive — from a single sentence to multiple paragraphs covering specific extraction rules, entity definitions, or relationship patterns relevant to your domain.

Example: "Focus on extracting financial terms and party relationships. When encountering contract clauses, extract the clause type, parties involved, and any monetary values or dates mentioned."

Entity Resolution

If your documents contain different text representations of the same real-world entity (e.g., “IBM”, “International Business Machines”, “IBM Corp.”), entity resolution can help. When enabled via bites_ie_enable_entity_resolution: true, the system performs document-level entity resolution to merge these duplicate mentions into unified “golden entities,” reducing redundancy in the knowledge graph.

Entity resolution adds one additional LLM call per document. Only enable this when you expect significant entity duplication in your documents and the merging benefit justifies the additional cost.

Tune with bites_ie_er_max_entities_per_call (default 50) to control batch size. Use bites_ie_er_instructions to provide domain-specific merging guidance — as with bites_ie_instructions, we recommend an iterative approach: run with entity resolution enabled but without custom instructions first, review the merging results, then add instructions to correct any gaps (e.g., rules for when abbreviations should or should not be merged). Instructions can be extensive.

Batch Processing

For information extraction, sequential chunks from a document are clubbed together into batches and sent in a single LLM call. The bites_ie_max_chunks_per_llm_call parameter controls the maximum number of chunks per batch. By default, this is auto-calculated based on the context_window size to maximize throughput. Override manually if you need to reduce the number of LLM calls or manage memory usage.

Complete IE Configuration Example

{
  "information_extraction_config": [{
    "task_type": "information_extraction",
    "extractor_type": "llm",
    "kwargs": {
      "bites_ie_entity_scope": {
        "GLOBAL_SCOPE": ["Party", "Company"],
        "DOCUMENT_SCOPE": ["Agreement", "Service"]
      },
      "bites_ie_instructions": "Focus on extracting financial terms and party relationships.",
      "bites_ie_max_chunks_per_llm_call": 5,
      "bites_ie_schema_sample_count": 8,
      "bites_ie_enable_entity_resolution": true,
      "bites_ie_er_max_entities_per_call": 50,
      "bites_ie_er_instructions": "Merge entities that refer to the same real-world entity."
    },
    "num_retries": 3,
    "query_timeout": 50000,
    "llm_config": {
      "max_tokens": 8192,
      "temperature": 0,
      "context_window": 128000,
      "stop": ["---", "</output_format>"],
      "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
      "llm_provider": "bedrock"
    }
  }]
}

Information extraction significantly increases processing time and costs due to LLM API calls per chunk. Budget accordingly for large document sets.

Start with default values, monitor performance, and gradually increase parallelism. Always test with a small subset before indexing your entire dataset.


Querying Indexed Documents

Once documents are indexed, you can query them through the Voicebox UI.

Vector Search Queries

Documents indexed without information extraction can be queried using natural language questions.

Voicebox UI

Example Questions:

  • “What are the main points in the Q4 financial report?”
  • “Summarize the product requirements document for Project Alpha”
  • “What were the action items from the last board meeting?”

Source Attribution

Voicebox provides source attribution for answers derived from indexed documents.

UI Hover

Hover over “Document Extracted” text to see:

  • Source file name
  • Page number (for PDFs) or section
  • Relevance score
  • Direct link to original document (if available)

Knowledge Graph Queries

If information extraction was enabled during indexing, you can ask questions that leverage the knowledge graph.

Voicebox UI KG Example

Example Questions:

  • “Who are the key people mentioned in relation to Project Apollo?”
  • “What organizations are connected to the merger discussion?”
  • “Show me all locations mentioned in the travel policy documents”

Knowledge Graph Benefits:

  • Discover relationships across multiple documents
  • Find entities and their connections
  • Ask graph-based questions (e.g., “What connects X and Y?”)

Source Lineage

Knowledge graph queries provide lineage showing which documents contributed to the answer.

Source Lineage

Lineage Information:

  • Source documents for each extracted entity
  • Document provenance chain

Deployment

This section provides detailed instructions for deploying BITES in your Kubernetes environment.

Deployment Architecture

Deployment Diagram

Components:

  1. Launchpad: Provides APIs and user interface
  2. voicebox-service: Manages job lifecycle, interacts with Spark Operator
  3. Spark Operator: Kubernetes operator for managing Spark applications
  4. voicebox-bites: Docker image containing the indexing application
  5. Spark Cluster: Dynamically created driver and executor pods
  6. Stardog: Database for indexed content and knowledge graphs

Prerequisites

Before deploying BITES, ensure:

  • Kubernetes cluster (version 1.19+)
  • kubectl configured to access your cluster
  • Helm 3.x installed (for Spark Operator installation)
  • Spark Operator installed in the cluster
  • Access to voicebox-bites Docker image
  • Stardog instance deployed and accessible from Kubernetes cluster

Step 1: Install Spark Operator

The Spark Operator manages the lifecycle of Spark applications in Kubernetes.

Installation via Helm:

# Add the Spark Operator Helm repository
helm repo add spark-operator https://kubeflow.github.io/spark-operator

# Update Helm repositories
helm repo update

# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \
  --set sparkJobNamespace=default

Verify Installation:

kubectl get pods -n spark-operator

You should see the spark-operator pod running.

Alternative Installation Methods:

See the official Spark Operator documentation for other installation options.

Step 2: Configure Docker Image Access

The voicebox-bites image must be accessible from your Kubernetes cluster.

Option 1: Pull from Stardog JFrog Registry

Request access to the Stardog JFrog registry and configure an image pull secret:

kubectl create secret docker-registry stardog-jfrog-secret \
  --docker-server=stardog-stardog-apps.jfrog.io \
  --docker-username=YOUR_USERNAME \
  --docker-password=YOUR_PASSWORD \
  --docker-email=YOUR_EMAIL \
  --namespace=default

Option 2: Push to Your Private Registry

  1. Pull the image from Stardog:
    docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current
    
  2. Tag and push to your registry:
    docker tag stardog-stardog-apps.jfrog.io/voicebox-bites:current \
      your-registry.com/voicebox-bites:current
    
    docker push your-registry.com/voicebox-bites:current
    
  3. Update image reference in vbx_bites_kube_config.yaml

Step 3: Configure vbx_bites_kube_config.yaml

The vbx_bites_kube_config.yaml file defines the Spark application specification.

Sample Configuration:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: voicebox-bites-job
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "stardog-stardog-apps.jfrog.io/voicebox-bites:current"
  imagePullPolicy: Always
  imagePullSecrets:
    - stardog-jfrog-secret
  mainApplicationFile: local:///app/src/voicebox_bites/etl/bulk_document_extraction.py
  sparkVersion: "3.5.0"
  restartPolicy:
    type: Never
  driver:
    cores: 2
    coreLimit: "2000m"
    memory: "4g"
    labels:
      version: 3.5.0
    serviceAccount: spark-operator
  executor:
    cores: 2
    instances: 3
    memory: "4g"
    labels:
      version: 3.5.0

Key Configuration Sections:

Image Configuration:

image: "your-registry.com/voicebox-bites:current"
imagePullPolicy: Always
imagePullSecrets:
  - your-image-pull-secret

Driver Configuration (controls the Spark driver):

driver:
  cores: 2              # CPU cores for driver
  coreLimit: "2000m"    # Maximum CPU (Kubernetes format)
  memory: "4g"          # Memory allocation
  serviceAccount: spark-operator

Executor Configuration (controls the Spark executors):

executor:
  cores: 2              # CPU cores per executor
  instances: 3          # Number of executor pods
  memory: "4g"          # Memory per executor

Sizing Guidelines:

Dataset Size Files Executor Instances Executor Memory Executor Cores
Small <100 2-3 4g 2
Medium 100-1000 4-6 8g 4
Large 1000-10000 8-12 16g 4
Very Large >10000 15-30 16g 4

Important: BITES does not support Kubernetes autoscaling. Configure a fixed number of executor instances and do not scale down while jobs are running.

Step 4: Configure voicebox-service

The voicebox-service needs to know where to find the Spark configuration.

Set Environment Variable:

env:
  - name: VBX_BITES_CONFIG_FILE
    value: "/config/vbx_bites_kube_config.yaml"

Mount Configuration File:

volumes:
  - name: bites-config
    configMap:
      name: vbx-bites-config

volumeMounts:
  - name: bites-config
    mountPath: /config

Create ConfigMap:

kubectl create configmap vbx-bites-config \
  --from-file=vbx_bites_kube_config.yaml \
  --namespace=default

Step 5: Configure RBAC

Ensure the voicebox-service has permissions to manage Spark applications.

Create Service Account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-operator
  namespace: default

Create Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-operator-role
  namespace: default
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["create", "get", "list", "delete", "update", "watch"]
  - apiGroups: ["sparkoperator.k8s.io"]
    resources: ["sparkapplications"]
    verbs: ["create", "get", "list", "delete", "update", "watch"]

Create RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-operator-rolebinding
  namespace: default
subjects:
  - kind: ServiceAccount
    name: spark-operator
    namespace: default
roleRef:
  kind: Role
  name: spark-operator-role
  apiGroup: rbac.authorization.k8s.io

Apply RBAC Configuration:

kubectl apply -f spark-rbac.yaml

Step 6: Configure Networking

Ensure proper network connectivity between components.

Required Connectivity:

  • Launchpad → voicebox-service (API calls)
  • voicebox-service → Kubernetes API (Spark job management)
  • Spark executors → Data sources (Google Drive, S3, etc.)
  • Spark executors → Stardog (indexing)
  • Spark executors → LLM providers (if using content enhancement or information extraction)

Firewall Rules:

  • Allow outbound HTTPS (443) for data source APIs
  • Allow outbound connections to Stardog endpoint
  • Allow outbound connections to LLM provider APIs

Step 7: Deploy voicebox-service

Deploy the voicebox-service with the configured settings. The voicebox-service is a standard Kubernetes deployment that manages the lifecycle of BITES Spark jobs.

Requirements:

  • Configuration: The VBX_BITES_CONFIG_FILE environment variable must point to the Spark configuration file created in Step 4
  • Spark Operator namespace: The Spark Operator must be configured to watch the namespace where voicebox-service creates SparkApplication resources. If voicebox-service creates jobs in a namespace other than where the Spark Operator was installed, ensure the operator’s sparkJobNamespace is set accordingly (see Step 1)
  • Network access: The service must be able to reach the Kubernetes API server to create and manage SparkApplication resources
  • RBAC: The service account must have the permissions configured in Step 5 (create, get, list, delete SparkApplications and pods)
  • Stardog connectivity: The service needs access to the Stardog endpoint for job status tracking
env:
  - name: VBX_BITES_CONFIG_FILE
    value: /config/vbx_bites_kube_config.yaml

Ensure the voicebox-service pod mounts the same ConfigMap created in Step 4 so that VBX_BITES_CONFIG_FILE resolves correctly.

Cluster Sizing Recommendations

Minimum Cluster Size:

  • 3 nodes
  • 4 CPU cores per node
  • 16 GB RAM per node

Recommended Production Cluster:

  • 5-10 nodes
  • 8 CPU cores per node
  • 32 GB RAM per node
  • 100 GB SSD per node (for temporary storage)

Scaling Considerations:

  • Each executor needs dedicated resources
  • Driver pod requires resources
  • Kubernetes system pods consume resources
  • Leave 20-30% capacity headroom

Do not enable cluster autoscaling for nodes running Spark executors. Scale the cluster before starting large jobs and maintain the size throughout job execution.


Logging

Comprehensive logging is essential for monitoring job execution and troubleshooting issues.

Logging Architecture

BITES provides two layers of logging:

  1. Spark Logging: Framework-level logs (job scheduling, task execution, etc.)
  2. BITES Application Logging: Application-level logs (document processing, API calls, etc.)

Spark Logging Configuration

Spark logging is configured using Log4j properties.

Setup Steps

  1. Create log4j.properties:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Reduce verbosity of some packages
log4j.logger.org.apache.spark.storage=WARN
log4j.logger.org.apache.spark.scheduler=WARN
log4j.logger.org.apache.spark.util.Utils=WARN
log4j.logger.org.apache.spark.executor=INFO

Logging Levels:

  • ERROR: Only errors
  • WARN: Warnings and errors
  • INFO: Informational messages (recommended for production)
  • DEBUG: Detailed debug information (use for troubleshooting only)
  • TRACE: Very verbose (not recommended)

DEBUG and TRACE levels generate extremely large log volumes. Use only for troubleshooting specific issues.

  1. Create ConfigMap:
kubectl create configmap spark-log4j-config \
  --from-file=log4j.properties \
  --namespace=default
  1. Update vbx_bites_kube_config.yaml:

Add the following under spec:

spec:
  sparkConf:
    "spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
    "spark.executor.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"

  driver:
    configMaps:
      - name: spark-log4j-config
        path: /opt/spark

  executor:
    configMaps:
      - name: spark-log4j-config
        path: /opt/spark
  1. Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml

BITES Application Logging

Application logging provides insights into document processing, API interactions, and business logic.

Setup Steps

  1. Create logging.conf:
[loggers]
keys=root,py4j

[logger_py4j]
level=WARN
handlers=nullHandler
qualname=py4j
propagate=0

[handlers]
keys=consoleHandler,nullHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=INFO
handlers=consoleHandler

[handler_nullHandler]
class=logging.NullHandler
level=CRITICAL
args=()

[handler_consoleHandler]
class=voicebox_bites.logging_setup.FlushingStreamHandler
level=INFO
formatter=simpleFormatter
args=(sys.stdout,)

[formatter_simpleFormatter]
format=%(asctime)s %(levelname)s [%(job_id)s] %(name)s - %(message)s

Logging Levels:

  • INFO: Recommended for production
  • DEBUG: Detailed processing information (use for troubleshooting)
  1. Create ConfigMap:
kubectl create configmap voicebox-bites-log-config \
  --from-file=logging.conf \
  --namespace=default
  1. Update vbx_bites_kube_config.yaml:

Add the following to both driver and executor sections:

driver:
  volumeMounts:
    - name: vbx-bites-logging-config-volume
      mountPath: /app/etc/logging.conf
      subPath: logging.conf

executor:
  volumeMounts:
    - name: vbx-bites-logging-config-volume
      mountPath: /app/etc/logging.conf
      subPath: logging.conf

# Add under spec.volumes
volumes:
  - name: vbx-bites-logging-config-volume
    configMap:
      name: voicebox-bites-log-config
  1. Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml

Custom Log Path:

If you need to use a different path, set the VOICEBOX_BITES_LOG_CONF environment variable:

env:
  - name: VOICEBOX_BITES_LOG_CONF
    value: "/custom/path/logging.conf"

Accessing Logs

View Driver Logs

# Get driver pod name
DRIVER_POD=$(kubectl get pods -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')

# View logs
kubectl logs $DRIVER_POD

# Follow logs in real-time
kubectl logs -f $DRIVER_POD

# Save logs to file
kubectl logs $DRIVER_POD > driver.log

View Executor Logs

# List executor pods
kubectl get pods -l spark-role=executor

# View specific executor logs
kubectl logs voicebox-bites-job-exec-1

# View all executor logs
kubectl logs -l spark-role=executor

# Follow executor logs
kubectl logs -f voicebox-bites-job-exec-1

Centralized Logging

For production deployments with many executors, use centralized logging via a log aggregation stack such as ELK (Elasticsearch, Logstash, Kibana), Fluentd, Grafana Loki, or your cloud provider’s native logging service.

BITES uses [vbx-bites][component] prefixes in all application log messages. Use the [vbx-bites] prefix to filter all BITES application logs:

*[vbx-bites]* AND level:ERROR

Troubleshooting

This section covers common issues and their solutions.

Permission Issues

Symptom

Error: Failed to create SparkApplication: User cannot create resource "sparkapplications"

Diagnosis

Check if the service account has the required permissions:

# Check permissions
kubectl auth can-i create sparkapplications --as=system:serviceaccount:default:spark-operator

# Check current role bindings
kubectl get rolebindings -o wide | grep spark-operator

Solution

Ensure proper RBAC configuration:

# Verify service account exists
kubectl get serviceaccount spark-operator

# Verify role exists and has correct permissions
kubectl describe role spark-operator-role

# Verify role binding
kubectl describe rolebinding spark-operator-rolebinding

# If missing, apply RBAC configuration
kubectl apply -f spark-rbac.yaml

See Step 5: Configure RBAC for complete RBAC configuration.

Data Source Authentication Errors

Google Drive: Authentication Failed

Symptom:

ERROR: Authentication failed: Invalid credentials

Common Causes:

  1. Service account JSON is not properly base64 encoded
  2. Service account doesn’t have access to the folder
  3. API not enabled in Google Cloud project

Solutions:

  1. Verify base64 encoding:
    # Encode correctly
    cat service-account.json | base64 -w 0
    
    # Test decoding
    echo "YOUR_BASE64_STRING" | base64 -d | jq .
    
  2. Share folder with service account:
    • Copy client_email from service account JSON
    • Share the Google Drive folder with this email
    • Grant at least “Viewer” permissions
  3. Enable Google Drive API:
    • Go to Google Cloud Console
    • Navigate to “APIs & Services” → “Library”
    • Search for “Google Drive API”
    • Click “Enable”

OneDrive/SharePoint: Invalid Client

Symptom:

ERROR: AADSTS7000215: Invalid client secret provided

Solutions:

  1. Regenerate client secret:
    • Secrets expire after a configured period
    • Go to Azure Portal → App registrations → Your app → Certificates & secrets
    • Create a new secret
    • Update credentials JSON and re-encode
  2. Verify permissions:
    • Check that all required permissions are granted
    • Ensure admin consent has been provided
    • Wait 5-10 minutes after granting permissions

S3: Access Denied

Symptom:

ERROR: Access Denied (Service: Amazon S3; Status Code: 403)

Solutions:

  1. Verify IAM permissions:
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
    
  2. Check bucket policy:
    • Ensure the bucket policy doesn’t deny access
    • Verify the IAM role/user is allowed
  3. Verify region:
    • Ensure region_name in credentials matches bucket region

Job Execution Issues

Job Stuck in SUBMITTED State

Symptom: Job status remains “SUBMITTED” for extended period.

Diagnosis:

# Check Spark Operator logs
kubectl logs -n spark-operator -l app=spark-operator

# Check pending pods
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor

# Describe pending pods
kubectl describe pod $DRIVER_POD_NAME

Common Causes:

  1. Insufficient cluster resources
  2. Image pull errors
  3. RBAC issues

Solutions:

  1. Insufficient resources:
    # Check node resources
    kubectl describe nodes
    
    # Solution: Scale cluster or reduce resource requests
    
  2. Image pull errors:
    # Check events
    kubectl get events --sort-by='.lastTimestamp'
    
    # Solution: Verify image pull secret and image URL
    

Job Fails with Out of Memory (OOM) Error

Symptom:

ERROR: Executor lost: OutOfMemoryError: Java heap space

Solutions:

  1. Increase executor memory:
    executor:
      memory: "8g"  # Increase from 4g
    
  2. Reduce parallelism:
    {
      "content_reader_parallelism": 10  // Reduce from 30
    }
    
  3. Reduce batch size:
    {
      "batch_size": 500  // Reduce from 1000
    }
    
  4. Increase number of executors (distribute load):
    executor:
      instances: 6  # Increase from 3
      memory: "4g"  # Keep same memory per executor
    

Job Fails with Rate Limit Errors

Symptom:

WARN: Rate limit exceeded for Google Drive API

Solutions:

  1. Reduce parallelism:
    {
      "list_file_parallelism": 3,
      "content_reader_parallelism": 5
    }
    

2 Request quota increase from data source provider.

Indexing Issues

Documents Not Appearing in Voicebox

Diagnosis Steps:

  1. Verify job completed successfully:
    curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/$JOB_ID" \
      -H "Authorization: Bearer YOUR_API_KEY"
    
  2. Check driver logs for indexing confirmation:
    kubectl logs $DRIVER_POD | grep "Indexed"
    
  3. Verify Stardog contains data:
    SELECT (COUNT(*) as ?count) WHERE {
      ?s ?p ?o
    }
    

Solutions:

  1. Job failed silently:
    • Check logs for errors
    • Rerun job with DEBUG logging
  2. Wrong database or graph:
    • Verify Stardog connection details
    • Check that Voicebox is querying the correct database/graph
  3. Stardog connectivity issue:
    • Test connectivity from Spark executor to Stardog
    • Check firewall rules

Logging Issues

No Logs Appearing

Diagnosis:

# Check if pods exist
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor

# Check pod status
kubectl describe pod $POD_NAME

# Check if ConfigMaps mounted correctly
kubectl exec $DRIVER_POD -- ls -la /opt/spark
kubectl exec $DRIVER_POD -- ls -la /app/etc

Solutions:

  1. ConfigMap not mounted:
    • Verify ConfigMap exists: kubectl get configmap
    • Check volumeMounts in pod spec
    • Verify path in VOICEBOX_BITES_LOG_CONF
  2. Wrong log level:
    • Check logging configuration
    • Ensure not set to ERROR or CRITICAL only
  3. Logs going to wrong destination:
    • Verify log handler configuration
    • Check stdout/stderr redirection

Logs Too Verbose

Solution:

Change log level from DEBUG to INFO:

For Spark Logs (log4j.properties):

log4j.rootCategory=INFO, console

For BITES Logs (logging.conf):

[logger_root]
level=INFO

Network Issues

Cannot Access Data Source

Symptom:

ERROR: Connection timeout when accessing data source

Solutions:

  1. Firewall blocking outbound connections:
    • Allow HTTPS (443) egress
    • Add data source domains to allow list
  2. Network policy blocking traffic:
    • Review network policies
    • Add exception for voicebox-bites pods
  3. DNS resolution issues:
    • Check DNS configuration in cluster
    • Verify CoreDNS is functioning

Cannot Connect to Stardog

Symptom:

ERROR: Connection refused: Stardog endpoint

Diagnosis:

# Test from driver pod
kubectl exec $DRIVER_POD -- curl -v http://stardog:5820/

# Check Stardog service
kubectl get svc stardog

Solutions:

  1. Stardog not accessible:
    • Verify Stardog is running
    • Check service endpoint
    • Verify network policies
  2. Wrong endpoint:
    • Check Stardog connection configuration
    • Verify port (default 5820)

Getting Help

If you cannot resolve an issue:

  1. Collect diagnostic information:
    • Job ID
    • Driver and executor logs
    • Spark Operator logs
    • Job configuration
    • Error messages
  2. Check documentation:

Docker Image Availability

Latest Version

Pull the most recent voicebox-bites image. The :current tag always points to the latest release:

docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current

Specific Versions

Pull a specific version. The current release is v0.3.0:

docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:v0.3.0

Additional Resources