Link Search Menu Expand Document
Start for Free

Using Unstructured Data with Voicebox

A comprehensive guide for developers to integrate and manage unstructured data with Stardog Voicebox using the BITES (Blob Indexing and Text Enrichment with Semantics) system.

Page Contents
  1. Overview
    1. What is BITES?
    2. Supported Data Sources
    3. Supported Document Formats
    4. Key Capabilities
    5. Architecture Overview
  2. Quick Start
    1. Prerequisites Checklist
    2. 5-Minute Setup (Google Drive)
  3. Prerequisites and Requirements
    1. Data Source Configuration
      1. Google Drive
      2. OneDrive
      3. Microsoft SharePoint
      4. Dropbox
      5. Amazon S3
      6. Local Storage
    2. API Key Generation
    3. Stardog Connection Configuration
      1. Option 1: Stardog Username and Password
      2. Option 2: SSO Authentication (Azure/Okta/Ping)
    4. LLM Provider Configuration
      1. Required Environment Variables by Provider
        1. AWS Bedrock
        2. Fireworks AI
        3. OpenAI
        4. Azure OpenAI
      2. Configuring Environment Variables in Kubernetes
    5. Deployment Prerequisites
  4. API Reference
    1. Authentication
    2. Base URL
    3. API Endpoints Overview
    4. Initiate Indexing Job
      1. Endpoint
      2. Request Headers
      3. Request Body Parameters
      4. Credentials Format by Data Source
      5. Minimal Request Example
      6. Complete Request Example with All Options
      7. Response
      8. Response Fields
    5. Get Job Status
      1. Endpoint
      2. Path Parameters
      3. Query Parameters
      4. Request Example
      5. Response
      6. Response Fields
      7. Status Codes
      8. Polling Recommendations
    6. Cancel Job
      1. Endpoint
      2. Path Parameters
      3. Request Body Parameters
      4. Request Example
      5. Response
      6. Response Fields
  5. Document Indexing Pipeline
    1. Pipeline Stages
      1. 1. List Directories
      2. 2. List Supported Files
      3. 3. Fetch Content and Metadata
      4. 4. Parse and Chunk Content
      5. 5. Enhance Content (Optional)
      6. 6. Information Extraction (Optional)
      7. 7. Index Chunks
    2. Pipeline Performance Considerations
  6. Job Configuration
    1. Configuration Structure
    2. Scalability Configuration
      1. list_file_parallelism
      2. content_reader_parallelism
      3. content_indexer_parallelism
    3. Functional Configuration
      1. document_store_type
      2. enhance_content
      3. extract_information
      4. store_list_file_config
      5. store_content_loader_config
      6. document_loader_config
      7. information_extraction_config
    4. Complete Configuration Example
    5. Configuration Best Practices
  7. Querying Indexed Documents
    1. Vector Search Queries
    2. Source Attribution
    3. Knowledge Graph Queries
    4. Source Lineage
  8. Deployment
    1. Deployment Architecture
    2. Prerequisites
    3. Step 1: Install Spark Operator
    4. Step 2: Configure Docker Image Access
    5. Step 3: Configure vbx_bites_kube_config.yaml
    6. Step 4: Configure voicebox-service
    7. Step 5: Configure RBAC
    8. Step 6: Configure Networking
    9. Step 7: Deploy voicebox-service
    10. Cluster Sizing Recommendations
  9. Logging
    1. Logging Architecture
    2. Spark Logging Configuration
      1. Setup Steps
    3. BITES Application Logging
      1. Setup Steps
    4. Accessing Logs
      1. View Driver Logs
      2. View Executor Logs
    5. Centralized Logging
  10. Troubleshooting
    1. Permission Issues
      1. Symptom
      2. Diagnosis
      3. Solution
    2. Data Source Authentication Errors
      1. Google Drive: Authentication Failed
      2. OneDrive/SharePoint: Invalid Client
      3. S3: Access Denied
    3. Job Execution Issues
      1. Job Stuck in SUBMITTED State
      2. Job Fails with Out of Memory (OOM) Error
      3. Job Fails with Rate Limit Errors
    4. Indexing Issues
      1. Documents Not Appearing in Voicebox
    5. Logging Issues
      1. No Logs Appearing
      2. Logs Too Verbose
    6. Network Issues
      1. Cannot Access Data Source
      2. Cannot Connect to Stardog
    7. Getting Help
  11. Docker Image Availability
    1. Latest Version
    2. Specific Versions
  12. Additional Resources

Overview

BITES (Blob Indexing and Text Enrichment with Semantics) is Stardog Voicebox’s unstructured data support system. It enables ingestion of documents from various cloud storage providers and local sources, allowing users to query both structured and unstructured data through Voicebox’s conversational AI interface.

What is BITES?

BITES provides an API-first approach to indexing and querying unstructured documents alongside your structured data in Stardog. The system leverages Apache Spark for distributed processing and integrates with your existing Kubernetes infrastructure.

Supported Data Sources

  • Google Drive - Cloud document storage
  • Microsoft OneDrive - Personal and business cloud storage
  • Microsoft SharePoint - Enterprise document management (Document Library only)
  • Dropbox - Cloud file storage
  • Amazon S3 - Object storage
  • Local Storage - File system accessible from Kubernetes environment

Supported Document Formats

  • Microsoft Word (DOCX)
  • PDF

The system currently supports parsing and indexing of textual and tabular data. Image parsing within documents is planned for a future release.

Key Capabilities

  • Data Ingestion: Automated ingestion from multiple data source types
  • Unified Querying: Query both structured and unstructured data through a single Voicebox interface
  • API-First Design: All functionality accessible through Launchpad’s public APIs
  • Distributed Processing: Spark-based job execution in your Kubernetes environment
  • Job Management: Full lifecycle management including status monitoring and cancellation
  • Vector Indexing: Document chunks indexed in Stardog’s vector store for semantic search
  • Knowledge Graph Creation: Optional extraction of entities and relationships from documents to build knowledge graphs

Beta Features: Information extraction and knowledge graph creation are currently in Beta. These features enable extraction of structured entities and relationships from unstructured text.

Architecture Overview

Architecture Diagram

System Flow:

  1. User initiates data ingestion and indexing via Launchpad’s public APIs
  2. Voicebox service creates and submits a Spark job to the Kubernetes cluster
  3. Spark job processes documents: reads from source, parses content, chunks text
  4. Processed chunks are indexed in Stardog’s vector store
  5. (Optional) If information extraction is enabled, entities and relationships are extracted and stored as a knowledge graph
  6. User queries Voicebox, which retrieves answers from both structured data and indexed documents

Quick Start

This section provides a fast-track setup example for indexing Google Drive documents.

Prerequisites Checklist

  • Kubernetes cluster with Spark Operator installed
  • Voicebox service and voicebox-bites containers running
  • Launchpad API key generated
  • Data source credentials configured
  • Stardog connection configured

5-Minute Setup (Google Drive)

  1. Configure Google Drive (see Google Drive Configuration)
  2. Get API Key from Launchpad → “Manage API Keys”
  3. Base64 encode your Google service account JSON
  4. Call the API (see example below)
  5. Monitor job using the job_id returned
# Example: Initiate indexing job
curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "directory": "your-folder-id",
    "credentials": "BASE64_ENCODED_SERVICE_ACCOUNT_JSON",
    "job_name": "my-indexing-job",
    "job_config": {
      "document_store_type": "google_drive",
      "enhance_content": false,
      "extract_information": false
    }
  }'

Prerequisites and Requirements

Data Source Configuration

Before using BITES APIs, you must configure access credentials for your data sources. Each provider requires specific setup steps.

Google Drive

Setup Steps:

  1. Navigate to Google Cloud Console
  2. Create a new project or select an existing one
  3. Enable the Google Drive API
  4. Go to “APIs & Services” → “OAuth consent screen” and configure
  5. Go to “Credentials” → “Create Credentials” → “Service Account”
  6. Create a Service Account with appropriate permissions
  7. In the Service Account details, go to the “Keys” tab
  8. Click “Add Key” → “Create new key” → “JSON”
  9. Download the JSON key file

Required API Scope:

Google Drive Scope

Add the following scope: https://www.googleapis.com/auth/drive.readonly

IAM Configuration:

Google Drive Key Setup

Service Account JSON Structure:

{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "your-private-key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n",
  "client_email": "your-service-account@your-project.iam.gserviceaccount.com",
  "client_id": "your-client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-service-account",
  "universe_domain": "googleapis.com"
}

Security Note: Store service account credentials securely. Never commit them to version control. Use environment variables or secret management systems.

Sharing Documents:

To allow the service account to access specific folders:

  1. Copy the client_email from the service account JSON
  2. Share the Google Drive folder with this email address
  3. Grant appropriate permissions (Viewer for read-only access)

OneDrive

Setup Steps:

  1. Go to Azure Portal
  2. Navigate to “Azure Active Directory” → “App registrations”
  3. Click “New registration”
  4. Provide a name and configure redirect URIs if needed
  5. After registration, note the Application (client) ID and Directory (tenant) ID from the overview page
  6. Go to “Certificates & Secrets” → “Client secrets” → “New client secret”
  7. Copy the secret value (not the secret ID)

Required Microsoft Graph Permissions:

  • Files.Read
  • Files.ReadAll (Delegated)
  • Files.ReadAll (Application)
  • offline_access (Delegated)
  • openid (Delegated)
  • Sites.ReadAll (Delegated)
  • User.Read (Delegated)

Azure Permissions

Grant Admin Consent:

After adding permissions, click “Grant admin consent” to approve them for your organization.

Secret Configuration:

Azure Secret

Credentials JSON Structure:

{
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}

Microsoft SharePoint

Only Document Library is currently supported for SharePoint.

Setup Steps:

  1. Create an Entra ID (Azure Active Directory) on the Azure Portal
  2. Create and register an application in Entra ID
  3. Note the Application (client) ID from the overview page
  4. Go to “Certificates & Secrets” and create a new client secret
  5. Copy the secret value

Required Microsoft Graph Permissions:

  • Files.Read
  • Files.ReadAll (Delegated)
  • Sites.ReadAll (Delegated)
  • Sites.ReadAll (Application)
  • User.Read (Delegated)

Required SharePoint Permissions:

  • AllSites.Read (Delegated)
  • MyFiles.Read (Delegated)
  • Sites.Read.All (Application)
  • Sites.Select.All (Application)
  • User.Read.All (Application)

SharePoint Permissions

Some permissions require M365 administrator approval. Contact your administrator to grant these permissions.

Credentials JSON Structure:

{
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}

Additional Required Information:

When calling the indexing API for SharePoint, you must also provide:

  • host_name: Your SharePoint host (e.g., “yourcompany.sharepoint.com”)
  • site_id: The SharePoint site ID
  • library_name: The document library name

Dropbox

Setup Steps:

  1. Go to Dropbox App Console
  2. Click “Create app”
  3. Choose “Scoped access” and “Full Dropbox” or “App folder”
  4. Provide an app name
  5. Go to the “Permissions” tab and enable required scopes:
    • files.metadata.read
    • files.content.read
  6. Note your App key and App secret

OAuth Authorization Flow:

  1. In your browser, visit:
    https://www.dropbox.com/oauth2/authorize?client_id=<APP_KEY>&token_access_type=offline&response_type=code
    

    Replace <APP_KEY> with your actual app key.

  2. Log in to Dropbox and approve the app

  3. Copy the authorization code from the redirect URL

  4. Exchange the authorization code for tokens:
    curl -X POST https://api.dropboxapi.com/oauth2/token \
      -d code=<AUTHORIZATION_CODE> \
      -d grant_type=authorization_code \
      -d client_id=<APP_KEY> \
      -d client_secret=<APP_SECRET>
    
  5. The response contains:
    • access_token: Current access token
    • refresh_token: Long-lived token for obtaining new access tokens

Credentials JSON Structure:

{
  "access_token": "your-current-access-token",
  "refresh_token": "your-refresh-token",
  "client_id": "your-app-key",
  "client_secret": "your-app-secret"
}

The BITES connector automatically handles token refresh. If the access token expires, it uses the refresh token to obtain a new one.

Amazon S3

BITES supports two authentication options for S3: IAM roles (recommended) and access keys.

Option 1: IAM Roles (Recommended for AWS-hosted applications)

  1. Go to AWS Management Console → IAM
  2. Click “Roles” → “Create role”
  3. Select the service that will assume this role (e.g., EC2, EKS)
  4. Attach permissions policies:
    • AmazonS3ReadOnlyAccess (for read-only access)
    • Or create a custom policy with minimal required permissions
  5. Review and create the role
  6. Attach this role to your Kubernetes nodes or pods

Option 2: Access Key and Secret Key

  1. Go to AWS Management Console → IAM
  2. Click “Users” → “Add user”
  3. Provide a username and select “Programmatic access”
  4. Attach permissions policies (e.g., AmazonS3ReadOnlyAccess)
  5. Complete user creation
  6. Download the CSV file containing the Access Key ID and Secret Access Key

Security Best Practice: Use IAM roles when possible. If using access keys, rotate them regularly and store them securely.

Required S3 Permissions:

Your IAM role or user must have the following permissions on the target bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucketVersions"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Bucket Configuration:

  1. Open AWS Console and navigate to S3
  2. Select your bucket
  3. Go to “Permissions” tab
  4. Ensure the IAM role/user has the required permissions listed above

Credentials JSON Structure:

{
  "aws_access_key_id": "your-access-key",
  "aws_secret_access_key": "your-secret-key",
  "region_name": "us-east-1",
  "use_iam_role": false
}

For IAM Role Authentication:

{
  "region_name": "us-east-1",
  "use_iam_role": true
}

Additional Required Information:

When calling the indexing API for S3, you must also provide:

  • bucket: The S3 bucket name in the extra_args parameter

Local Storage

Local storage allows indexing of files directly accessible from the Kubernetes environment.

Requirements:

  • The directory must be accessible from the Spark executors
  • Use Kubernetes volumes (PersistentVolumes, ConfigMaps, or mounted storage)
  • Ensure appropriate read permissions

Credentials:

No credentials are required for local storage. Pass an empty JSON object (base64 encoded):

{}

Directory Path:

Provide the absolute path to the directory within the container filesystem (e.g., /mnt/data/documents).

API Key Generation

All BITES APIs require authentication using a Launchpad API key.

Steps to Generate API Key:

  1. Log in to Launchpad
  2. Navigate to “Manage API Keys”
  3. Click “Create API Key”
  4. Provide a name and select the database
  5. Copy and securely store the API key

API keys provide full access to your Voicebox instance. Store them securely and never expose them in client-side code or public repositories.

Using the API Key:

Include the API key in the Authorization header of all API requests:

Authorization: Bearer YOUR_API_KEY

Stardog Connection Configuration

BITES jobs need to connect to Stardog to index documents. Two authentication options are supported.

Option 1: Stardog Username and Password

Prerequisites:

  • Configure Stardog to generate authentication tokens
  • Ensure token generation is enabled in Stardog configuration

How it Works:

When you initiate a job, Stardog generates a token using the provided credentials and passes it to the Spark job.

Configuration:

Provide Stardog credentials when creating the connection in Launchpad.

Option 2: SSO Authentication (Azure/Okta/Ping)

Prerequisites:

  • SSO provider configured and integrated with Stardog
  • Valid refresh token from SSO provider
  • SSO provider client ID

How it Works:

When you initiate a job, the system calls the SSO provider to fetch an access token using the refresh token. This access token is then passed to the Spark job.

Configuration:

When calling the indexing API, provide:

  • sso_provider_client_id: Your SSO provider’s client ID
  • refresh_token: A valid refresh token from your SSO provider

Token Expiry Recommendation: Set token expiry based on the expected job duration. For large indexing jobs, we recommend setting the token expiry to 30 days to ensure the job completes without authentication failures.

LLM Provider Configuration

If you plan to use content enhancement (enhance_content: true) or information extraction (extract_information: true), you must configure environment variables for your LLM provider. These variables must be available on both the driver and executor nodes.

Required Environment Variables by Provider

AWS Bedrock
env:
  - name: AWS_ACCESS_KEY_ID
    value: "your-aws-access-key-id"
  - name: AWS_SECRET_ACCESS_KEY
    value: "your-aws-secret-access-key"
  - name: AWS_REGION
    value: "us-east-1"

For AWS Bedrock, ensure your IAM user/role has bedrock:InvokeModel permission for the models you plan to use.

Fireworks AI
env:
  - name: FIREWORKS_API_KEY
    value: "your-fireworks-api-key"
OpenAI
env:
  - name: OPENAI_API_KEY
    value: "your-openai-api-key"
Azure OpenAI
env:
  - name: AZURE_OPENAI_API_KEY
    value: "your-azure-openai-key"
  - name: AZURE_OPENAI_ENDPOINT
    value: "https://your-resource.openai.azure.com/"

Configuring Environment Variables in Kubernetes

Option 1: Using Kubernetes Secrets (Recommended). Then reference the secret in vbx_bites_kube_config.yaml.

Option 2: Direct Environment Variables (Not Recommended for Production)

Deployment Prerequisites

Before running indexing jobs, ensure your Kubernetes environment is properly configured.

Required Components:

  • Kubernetes cluster (version 1.19+)
  • Spark Operator installed in the cluster
  • voicebox-bites Docker image accessible
  • voicebox-service running and configured
  • Network connectivity between Launchpad, voicebox-service, and voicebox-bites

See the Deployment section for detailed setup instructions.


API Reference

All BITES functionality is accessed through RESTful APIs. This section provides complete API documentation with examples.

Authentication

All API requests must include an Authorization header with your Launchpad API key:

Authorization: Bearer YOUR_API_KEY

Base URL

https://your-launchpad-url/api/v1/voicebox/bites

Replace your-launchpad-url with your actual Launchpad instance URL.

API Endpoints Overview

Endpoint Method Description
/jobs POST Initiate a new indexing job
/jobs/{job_id} GET Get the status of a job
/jobs/{job_id}/cancel POST Cancel a running job

Initiate Indexing Job

Creates and starts a new indexing job in the Spark environment.

Endpoint

POST /api/v1/voicebox/bites/jobs

Request Headers

Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

Request Body Parameters

Parameter Type Required Description
directory string Yes Directory location or ID. For Google Drive: folder ID; OneDrive: folder path; Local: absolute path;
credentials string Yes Base64-encoded JSON containing data source credentials. See Data Source Configuration for format
job_name string Yes Unique name for the job (used for tracking and management)
job_namespace string No Kubernetes namespace for the job. Defaults to namespace in vbx_bites_kube_config.yaml
batch_size integer No Number of chunks to commit at once. Default: 1000. Increase for better performance, decrease if memory constrained
job_config object Yes Configuration controlling scalability and functionality. See Job Configuration
sso_provider_client_id string Conditional Required for SSO authentication. SSO provider’s client ID
refresh_token string Conditional Required for SSO authentication. Valid refresh token from SSO provider
extra_args object No Additional arguments specific to data source type. See below

Extra Args by Data Source:

Data Source Extra Args Required Example
OneDrive one_drive_id {"one_drive_id": "b!drive_id"}
SharePoint host_name, site_id, library_name {"host_name": "company.sharepoint.com", "site_id": "site-id", "library_name": "Documents"}
S3 bucket, prefix {"bucket": "my-documents-bucket", "prefix: "S3 path" }
Google Drive None -
Dropbox None -
Local None -

Credentials Format by Data Source

Before passing to the API, you must base64-encode the JSON.

Google Drive:

{
  "type": "service_account",
  "project_id": "your-project",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "service-account@project.iam.gserviceaccount.com",
  "client_id": "client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/...",
  "universe_domain": "googleapis.com"
}

OneDrive / SharePoint:

{
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}

Dropbox:

{
  "access_token": "current-access-token",
  "refresh_token": "refresh-token",
  "client_id": "app-key",
  "client_secret": "app-secret"
}

S3:

{
  "aws_access_key_id": "your-access-key",
  "aws_secret_access_key": "your-secret-key",
  "region_name": "us-east-1",
  "use_iam_role": false
}

Local Storage:

{}

Minimal Request Example

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "directory": "1A2B3C4D5E6F7G8H9I",
    "credentials": "eyJ0eXBlIjoic2VydmljZV9hY2NvdW50IiwicHJvamVjdF9pZCI6InlvdXItcHJvamVjdCJ9",
    "job_name": "index-google-drive-docs",
    "job_config": {
      "document_store_type": "google_drive",
      "enhance_content": false,
      "extract_information": false
    }
  }'

Complete Request Example with All Options

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "directory": "b!AbCdEf123456",
    "credentials": "eyJ0ZW5hbnRfaWQiOiJ5b3VyLXRlbmFudC1pZCIsImNsaWVudF9pZCI6InlvdXItY2xpZW50LWlkIiwiY2xpZW50X3NlY3JldCI6InlvdXItc2VjcmV0In0=",
    "job_name": "onedrive-quarterly-reports",
    "job_namespace": "voicebox-production",
    "batch_size": 2000,
    "sso_provider_client_id": "your-sso-client-id",
    "refresh_token": "your-refresh-token",
    "extra_args": {
      "one_drive_id": "b!AbCdEf123456"
    },
    "job_config": {
      "list_file_parallelism": 10,
      "content_reader_parallelism": 20,
      "content_indexer_parallelism": 10,
      "document_store_type": "onedrive",
      "enhance_content": true,
      "extract_information": true,
      "store_list_file_config": {
        "page_size": 100,
        "recursive": true,
        "document_types": ["document", "pdf"]
      },
      "store_content_loader_config": {
        "num_retries": 3,
        "store_loader_kwargs": {}
      },
      "document_loader_config": {
        "pdf": {
          "chunk_size": 1000,
          "chunking_enabled": true,
          "chunk_separator": ["\\n\\n", "\\n", " ", ""],
          "chunk_overlap": 200
        },
        "document": {
          "chunk_size": 1000,
          "chunking_enabled": true,
          "chunk_separator": ["\\n\\n", "\\n", " ", ""],
          "chunk_overlap": 200
        }
      },
      "information_extraction_config": [
        {
          "task_type": "information_extraction",
          "kwargs": {},
          "llm_config": {
            "max_tokens": 8192,
            "temperature": 0.0,
            "repetition_penalty": 1.0,
            "top_p": 0.7,
            "top_k": 50,
            "stop": ["---", "</output_format>"],
            "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
            "llm_provider": "bedrock",
            "context_window": 4000
          },
          "num_retries": 3,
          "prompt_version": "v4",
          "extractor_type": "llm",
          "query_timeout": 50000
        }
      ]
    }
  }'

Response

Success Response (HTTP 200):

{
  "job_id": "spark-app-1234567890-abcdef",
  "error": null
}

Error Response (HTTP 400/500):

{
  "job_id": null,
  "error": "Failed to create job: Invalid credentials format"
}

Response Fields

Field Type Description
job_id string or null Unique identifier for the created job. Use this to check status or cancel the job
error string or null Error message if job creation failed, null otherwise

Get Job Status

Retrieves the current status of an indexing job.

Endpoint

GET /api/v1/voicebox/bites/jobs/{job_id}

Path Parameters

Parameter Type Required Description
job_id string Yes Job ID returned when the job was created

Query Parameters

Parameter Type Required Description
job_namespace string No Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml

Request Example

curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

Success Response (HTTP 200):

{
  "status_code": "RUNNING",
  "status": "Job is processing documents. Completed 45 of 100 files."
}

Job Not Found (HTTP 404):

{
  "status_code": "UNKNOWN",
  "status": "Job not found"
}

Response Fields

Field Type Description
status_code string Current state of the job. See status codes below
status string Human-readable status message with additional details

Status Codes

Status Code Description
NEW Job created but not yet submitted to Spark
SUBMITTED Job submitted to Spark cluster, waiting for resources
RUNNING Job actively processing documents
PENDING_RERUN Job failed and is waiting to be retried
INVALIDATING Job is being invalidated
SUCCEEDING Job is in the process of completing successfully
COMPLETED Job finished successfully
ERROR Job encountered a non-recoverable error
FAILING Job is in the process of failing
FAILED Job failed
UNKNOWN Job status cannot be determined (job may not exist)

Polling Recommendations

  • Poll every 10-30 seconds for jobs expected to complete quickly
  • Poll every 1-5 minutes for long-running jobs
  • Stop polling when status is COMPLETED, FAILED, or ERROR

Cancel Job

Cancels a running or pending indexing job.

Endpoint

POST /api/v1/voicebox/bites/jobs/{job_id}/cancel

Path Parameters

Parameter Type Required Description
job_id string Yes Job ID of the job to cancel

Request Body Parameters

Parameter Type Required Description
job_name string Yes Name of the job to cancel (must match the name used when creating the job)
job_namespace string No Kubernetes namespace of the job. Defaults to namespace in vbx_bites_kube_config.yaml

Request Example

curl -X POST "https://your-launchpad-url/api/v1/voicebox/bites/jobs/spark-app-1234567890-abcdef/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "job_name": "index-google-drive-docs"
  }'

Response

Success Response (HTTP 200):

{
  "success": true,
  "error": null
}

Error Response (HTTP 400/500):

{
  "success": false,
  "error": "Job not found or already completed"
}

Response Fields

Field Type Description
success boolean True if the job was successfully canceled, false otherwise
error string or null Error message if cancellation failed, null otherwise

Canceling a job may take a few moments. The Spark operator will gracefully terminate the running executors. Already indexed documents will remain in Stardog.


Document Indexing Pipeline

Understanding the indexing pipeline helps you configure jobs effectively and troubleshoot issues.

Pipeline Diagram

Pipeline Stages

1. List Directories

Purpose: Enumerate all directories within the specified location.

Configuration: Controlled by list_file_parallelism and recursive settings.

Output: List of directories to scan for files.

2. List Supported Files

Purpose: Identify all supported document types (PDF, DOCX) within the directories.

Configuration: Filtered by document_types in store_list_file_config.

Output: List of file paths/IDs with metadata (size, modification date, etc.).

3. Fetch Content and Metadata

Purpose: Download file content from the data source and extract metadata.

Configuration: Controlled by content_reader_parallelism. Retries configured via num_retries.

Metadata Captured:

  • File name and path
  • Creation and modification dates
  • Author information (if available)
  • File size
  • MIME type

Output: Raw file content and associated metadata.

4. Parse and Chunk Content

Purpose: Extract text from documents and split into manageable chunks.

Parsing:

  • PDF: Text and Table extraction from PDF files
  • DOCX: Text and table extraction from Word documents

Chunking:

  • Split text based on chunk_size and chunk_separator
  • Apply chunk_overlap to preserve context between chunks
  • Maintain metadata association with each chunk

Configuration: Controlled by document_loader_config.

Output: Array of text chunks with metadata.

5. Enhance Content (Optional)

Purpose: Enrich document content using LLM processing.

When to Use:

  • Documents contain complex tables that need better semantic representation
  • Specialized domain content that benefits from LLM enhancement

Process:

  • Tables are processed by LLM to generate descriptive text
  • Enhanced descriptions improve search relevance

Configuration: Set enhance_content: true in job_config.

Cost Consideration: This step makes additional LLM API calls, increasing processing time and cost.

Output: Enhanced chunks with improved semantic content.

6. Information Extraction (Optional)

Purpose: Extract structured entities and relationships to build a knowledge graph.

When to Use:

  • You want to build a knowledge graph from unstructured documents
  • Need to identify entities (people, organizations, locations) and their relationships
  • Want to enable graph-based queries alongside vector search

Process:

  • LLM analyzes each chunk to identify entities and relationships
  • Extracts triples in RDF format (subject-predicate-object)
  • Links entities across documents

Configuration: Set extract_information: true and configure information_extraction_config.

Output: RDF triples representing extracted knowledge.

Cost Consideration: This step makes additional LLM API calls per chunk, significantly increasing processing time and cost.

7. Index Chunks

Purpose: Store processed chunks and knowledge graph in Stardog.

Indexing Operations:

  • Vector Indexing: Chunks are embedded and stored in Stardog’s vector store
  • Metadata Indexing: File metadata stored for filtering and source attribution
  • Knowledge Graph Storage: Extracted triples stored in specified graph (if information extraction enabled)

Configuration: Controlled by content_indexer_parallelism and batch_size.

Output: Indexed and searchable content in Stardog.

Pipeline Performance Considerations

Bottlenecks:

  1. Data Source API Limits: Google Drive, OneDrive have rate limits
  2. Network Bandwidth: Large files take time to download
  3. LLM Processing: Content enhancement and information extraction are slow
  4. Stardog Ingestion: High parallelism can overload Stardog

Optimization Tips:

  • Start with conservative parallelism settings
  • Monitor data source rate limits
  • Use content enhancement and information extraction only when necessary
  • Increase batch_size for better ingestion throughput
  • Scale Stardog appropriately for your indexing load

Job Configuration

The job_config parameter controls both scalability and functionality of the indexing pipeline. This section provides detailed guidance on each configuration option.

Configuration Structure

{
  "list_file_parallelism": <integer>,
  "content_reader_parallelism": <integer>,
  "content_indexer_parallelism": <integer>,
  "document_store_type": "<string>",
  "enhance_content": <boolean>,
  "extract_information": <boolean>,
  "store_list_file_config": { ... },
  "store_content_loader_config": { ... },
  "document_loader_config": { ... },
  "information_extraction_config": [ ... ]
}

Scalability Configuration

These settings control the degree of parallelism in different pipeline stages.

list_file_parallelism

Purpose: Controls parallel fetching of file listings from subdirectories.

Type: Integer

Default: 5

When to Adjust:

  • Increase if you have many subdirectories and fast data source API
  • Decrease if hitting data source rate limits
  • No effect if recursive: false

Recommended Values:

  • Small dataset (<100 files): 5
  • Medium dataset (100-1000 files): 10
  • Large dataset (>1000 files): 15-20

Example:

{
  "list_file_parallelism": 10
}

Higher values increase pressure on data source APIs (e.g., Google Drive API). Monitor for rate limit errors.

content_reader_parallelism

Purpose: Controls parallel downloading and parsing of document content.

Type: Integer

Default: 10

Impact: This is typically the most impactful scalability setting.

When to Adjust:

  • Increase for large datasets with available compute resources
  • Decrease if running out of memory or hitting data source rate limits

Recommended Values by Dataset Size:

  • Small (<100 files): 10
  • Medium (100-1000 files): 20-30
  • Large (>1000 files): 30-50

Memory Considerations:

  • Each parallel reader consumes memory for document content
  • Large PDFs can consume significant memory
  • Ensure Spark executors have sufficient memory: (Average File Size) × (Parallelism) × 2

Example:

{
  "content_reader_parallelism": 30
}

content_indexer_parallelism

Purpose: Controls parallel writing of chunks to Stardog vector store.

Type: Integer

Default: 5

When to Adjust:

  • Increase if Stardog can handle higher load and you want faster indexing
  • Decrease if Stardog shows performance degradation or errors

Recommended Values:

  • Default Stardog instance: 5
  • Scaled Stardog instance: 10

Warning: Setting this too high can overwhelm Stardog, causing:

  • Slow response times
  • Connection timeouts
  • Out of memory errors

Example:

{
  "content_indexer_parallelism": 10
}

Monitor Stardog resource utilization (CPU, memory, disk I/O) when increasing this value. Add resources or reduce parallelism if needed.

Functional Configuration

These settings control the features and behavior of the indexing pipeline.

document_store_type

Purpose: Specifies the data source type.

Type: String

Required: Yes

Valid Values:

  • google_drive
  • onedrive
  • sharepoint
  • dropbox
  • s3
  • local

Example:

{
  "document_store_type": "google_drive"
}

enhance_content

Purpose: Enables LLM-based content enhancement for better semantic understanding.

Type: Boolean

Default: false

When to Enable:

  • Documents contain complex tables
  • Search quality is more important than indexing speed

When to Disable (By Default disabled):

  • Simple text documents
  • Cost or speed is a concern
  • Sufficient search quality without enhancement

Impact:

  • Processing Time: Increases by 3-10x depending on content
  • Cost: Additional LLM API calls per table/complex structure
  • Search Quality: Improved for table-heavy documents

Example:

{
  "enhance_content": true
}

extract_information

Purpose: Enables LLM-based entity and relationship extraction to build a knowledge graph.

Type: Boolean

Default: false

When to Enable:

  • Need to build a knowledge graph from documents
  • Want to identify entities and their relationships
  • Require graph-based queries and reasoning
  • Need to link information across multiple documents

When to Disable (By Default disabled):

  • Only need vector search capabilities and RAG capabilities in voicebox
  • Cost or speed is a critical concern
  • Documents don’t contain entity-rich content

Impact:

  • Processing Time: Increases by 5-20x depending on content
  • Cost: Additional LLM API calls per chunk
  • Capabilities: Enables knowledge graph queries and relationship discovery

Output:

  • RDF triples stored in Stardog
  • Entities: People, organizations, locations, concepts
  • Relationships: Connections between entities

Example:

{
  "extract_information": true
}

Information extraction requires information_extraction_config to be properly configured with LLM settings.

store_list_file_config

Purpose: Controls file discovery behavior.

Type: Object

Fields:

Field Type Default Description
page_size integer 100 Number of files to fetch per API call (pagination)
recursive boolean true Search subdirectories recursively
document_types array [“document”, “pdf”] File types to include
loader_kwargs object {} Additional data-source-specific parameters

page_size:

  • Increase for faster initial file discovery (if data source supports it)
  • Decrease if API requests timeout
  • Typical range: 50-200

recursive:

  • true: Index all files in subdirectories (most common)
  • false: Only index files in the specified directory

document_types:

  • "document": Microsoft Word (DOCX) files
  • "pdf": PDF files
  • Include both for comprehensive indexing

Example:

{
  "store_list_file_config": {
    "page_size": 150,
    "recursive": true,
    "document_types": ["document", "pdf"],
    "loader_kwargs": {}
  }
}

store_content_loader_config

Purpose: Controls file content fetching behavior.

Type: Object

Fields:

Field Type Default Description
num_retries integer 2 Number of retry attempts for failed downloads
store_loader_kwargs object {} Additional data-source-specific parameters

num_retries:

  • Recommended: 2-3 for reliability
  • Network issues and transient failures are common with cloud storage
  • Higher values increase job duration if files are genuinely inaccessible

Example:

{
  "store_content_loader_config": {
    "num_retries": 3,
    "store_loader_kwargs": {}
  }
}

document_loader_config

Purpose: Defines parsing and chunking strategy per document type.

Type: Object with nested configuration per document type

Structure:

{
  "document_loader_config": {
    "pdf": { <pdf-specific config> },
    "document": { <docx-specific config> }
  }
}

Per-Document-Type Configuration:

Field Type Default Description
chunk_size integer 1000 Maximum characters per chunk
chunking_enabled boolean true Enable/disable chunking
chunk_separator array [“\n\n”, “\n”, “ “, “”] Separators for splitting text, in priority order
chunk_overlap integer 0 Overlapping characters between consecutive chunks
loader_kwargs object {} Document parser specific parameters
loader_type string varies Parser type (e.g., “py_pdf”, “DocxLoader”)

chunk_size:

Determines the size of text segments for indexing.

How to Choose:

  • Small (300-500): Better precision, more chunks, slower indexing
  • Medium (800-1200): Balanced approach (recommended), by default 1000
  • Large (1500-2000): Faster indexing, may lose precision

Considerations:

  • LLM context window for embeddings
  • Average paragraph/section length in your documents
  • Trade-off between precision and recall

chunking_enabled:

  • true: Documents split into chunks (recommended)
  • false: Entire document treated as single chunk (only for very small documents)

chunk_separator:

Priority-ordered list of separators for splitting text.

Default: ["\n\n", "\n", " ", ""]

  • First tries to split on double newlines (paragraphs)
  • Then single newlines (lines)
  • Then spaces (words)
  • Finally characters (as last resort)

Custom Separators:

  • Add domain-specific separators (e.g., "===Section===")
  • Maintain hierarchical order (largest semantic units first)

chunk_overlap:

Number of characters to overlap between consecutive chunks.

Purpose: Preserve context across chunk boundaries.

When to Use:

  • 0: No overlap, faster indexing, distinct chunks. By default 0.
  • 100-200: Moderate overlap for general documents
  • 300-500: High overlap for complex documents where context is critical

Example:

{
  "document_loader_config": {
    "pdf": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", " ", ""],
      "chunk_overlap": 0,
      "loader_kwargs": {},
      "loader_type": "py_pdf"
    },
    "document": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", " ", ""],
      "chunk_overlap": 0,
      "loader_kwargs": {},
      "loader_type": "DocxLoader"
    }
  }
}

Different Configurations by Type:

You can use different settings for PDFs vs. DOCX:

{
  "document_loader_config": {
    "pdf": {
      "chunk_size": 1200,
      "chunk_overlap": 200
    },
    "document": {
      "chunk_size": 800,
      "chunk_overlap": 100
    }
  }
}

information_extraction_config

Purpose: Configures LLM-based entity and relationship extraction.

Type: Array of extraction task configurations

Required when: extract_information: true

Structure:

{
  "information_extraction_config": [
    {
      "task_type": "information_extraction",
      "kwargs": {},
      "llm_config": { <LLM configuration> },
      "num_retries": 3,
      "prompt_version": "v4",
      "extractor_type": "llm",
      "query_timeout": 50000
    }
  ]
}

Task Configuration Fields:

Field Type Default Description
task_type string “information_extraction” Type of extraction task
kwargs object {} Additional task-specific parameters
llm_config object required LLM model configuration
num_retries integer 3 Retry attempts for failed LLM calls
prompt_version string “v4” Version of extraction prompt template
extractor_type string “llm” Extractor implementation type
query_timeout integer 50000 Timeout for LLM queries in milliseconds

LLM Configuration Fields:

Field Type Description
max_tokens integer Maximum tokens in LLM response
temperature float Sampling temperature (0.0 = deterministic, 1.0 = creative)
repetition_penalty float Penalty for repeated tokens
top_p float Nucleus sampling parameter
top_k integer Top-k sampling parameter
stop array Stop sequences for generation
llm_name string Model identifier (provider-specific)
llm_provider string LLM provider (e.g., “bedrock”, “fireworks”, “openai”)
context_window integer Context window size for the model

LLM Configuration Recommendations:

For Accuracy (recommended for information extraction):

{
  "temperature": 0.0,
  "repetition_penalty": 1.0,
  "top_p": 0.7,
  "top_k": 50,
  "max_tokens": 8192
}

Supported LLM Providers:

  • AWS Bedrock: llm_provider: "bedrock"
    • Models: Llama, Claude, etc.
    • Example: "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0"
  • Fireworks AI llm_provider: "fireworks"
  • OpenAI: llm_provider: "openai"
    • Models: GPT-4, GPT-3.5

Complete Example:

{
  "information_extraction_config": [
    {
      "task_type": "information_extraction",
      "kwargs": {},
      "llm_config": {
        "max_tokens": 8192,
        "temperature": 0.0,
        "repetition_penalty": 1.0,
        "top_p": 0.7,
        "top_k": 50,
        "stop": ["---", "</output_format>", "</output_format>\n", "</output_format>\r\n"],
        "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
        "llm_provider": "bedrock",
        "context_window": 4000
      },
      "num_retries": 3,
      "prompt_version": "v4",
      "extractor_type": "llm",
      "query_timeout": 50000
    }
  ]
}

Performance Tuning:

  • query_timeout: Increase for complex documents or slow LLM endpoints
  • num_retries: Increase for unreliable network connections
  • context_window: Match to your LLM’s actual context window to avoid truncation

Information extraction significantly increases processing time and costs due to LLM API calls per chunk. Budget accordingly for large document sets.

Complete Configuration Example

This example demonstrates a production-ready configuration for a large dataset:

{
  "list_file_parallelism": 10,
  "content_reader_parallelism": 30,
  "content_indexer_parallelism": 10,
  "document_store_type": "google_drive",
  "enhance_content": false,
  "extract_information": true,
  "store_list_file_config": {
    "page_size": 100,
    "recursive": true,
    "document_types": ["document", "pdf"],
    "loader_kwargs": {}
  },
  "store_content_loader_config": {
    "num_retries": 3,
    "store_loader_kwargs": {}
  },
  "document_loader_config": {
    "pdf": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", " ", ""],
      "chunk_overlap": 200,
      "loader_kwargs": {},
      "loader_type": "py_pdf"
    },
    "document": {
      "chunk_size": 1000,
      "chunking_enabled": true,
      "chunk_separator": ["\n\n", "\n", " ", ""],
      "chunk_overlap": 200,
      "loader_kwargs": {},
      "loader_type": "DocxLoader"
    }
  },
  "information_extraction_config": [
    {
      "task_type": "information_extraction",
      "kwargs": {},
      "llm_config": {
        "max_tokens": 8192,
        "temperature": 0.0,
        "repetition_penalty": 1.0,
        "top_p": 0.7,
        "top_k": 50,
        "stop": ["---", "</output_format>", "</output_format>\n", "</output_format>\r\n"],
        "llm_name": "us.meta.llama4-maverick-17b-instruct-v1:0",
        "llm_provider": "bedrock",
        "context_window": 4000
      },
      "num_retries": 3,
      "prompt_version": "v4",
      "extractor_type": "llm",
      "query_timeout": 50000
    }
  ]
}

Configuration Best Practices

  1. Start Conservative: Use default values for initial jobs
  2. Monitor Performance: Watch for bottlenecks (data source APIs, memory, Stardog load)
  3. Iterate: Gradually increase parallelism while monitoring
  4. Test Small: Run a small subset before indexing your entire dataset
  5. Consider Costs: LLM-based features (enhance_content, extract_information) significantly increase costs
  6. Match Resources: Ensure Spark cluster and Stardog instance are sized appropriately for your parallelism settings

Querying Indexed Documents

Once documents are indexed, you can query them through the Voicebox UI.

Vector Search Queries

Documents indexed without information extraction can be queried using natural language questions.

Voicebox UI

Example Questions:

  • “What are the main points in the Q4 financial report?”
  • “Summarize the product requirements document for Project Alpha”
  • “What were the action items from the last board meeting?”

Source Attribution

Voicebox provides source attribution for answers derived from indexed documents.

UI Hover

Hover over “Document Extracted” text to see:

  • Source file name
  • Page number (for PDFs) or section
  • Relevance score
  • Direct link to original document (if available)

Knowledge Graph Queries

If information extraction was enabled during indexing, you can ask questions that leverage the knowledge graph.

Voicebox UI KG Example

Example Questions:

  • “Who are the key people mentioned in relation to Project Apollo?”
  • “What organizations are connected to the merger discussion?”
  • “Show me all locations mentioned in the travel policy documents”

Knowledge Graph Benefits:

  • Discover relationships across multiple documents
  • Find entities and their connections
  • Ask graph-based questions (e.g., “What connects X and Y?”)

Source Lineage

Knowledge graph queries provide lineage showing which documents contributed to the answer.

Source Lineage

Lineage Information:

  • Source documents for each extracted entity
  • Document provenance chain

Deployment

This section provides detailed instructions for deploying BITES in your Kubernetes environment.

Deployment Architecture

Deployment Diagram

Components:

  1. Launchpad: Provides APIs and user interface
  2. voicebox-service: Manages job lifecycle, interacts with Spark Operator
  3. Spark Operator: Kubernetes operator for managing Spark applications
  4. voicebox-bites: Docker image containing the indexing application
  5. Spark Cluster: Dynamically created driver and executor pods
  6. Stardog: Database for indexed content and knowledge graphs

Prerequisites

Before deploying BITES, ensure:

  • Kubernetes cluster (version 1.19+)
  • kubectl configured to access your cluster
  • Helm 3.x installed (for Spark Operator installation)
  • Spark Operator installed in the cluster
  • Access to voicebox-bites Docker image
  • Stardog instance deployed and accessible from Kubernetes cluster

Step 1: Install Spark Operator

The Spark Operator manages the lifecycle of Spark applications in Kubernetes.

Installation via Helm:

# Add the Spark Operator Helm repository
helm repo add spark-operator https://kubeflow.github.io/spark-operator

# Update Helm repositories
helm repo update

# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \
  --set sparkJobNamespace=default

Verify Installation:

kubectl get pods -n spark-operator

You should see the spark-operator pod running.

Alternative Installation Methods:

See the official Spark Operator documentation for other installation options.

Step 2: Configure Docker Image Access

The voicebox-bites image must be accessible from your Kubernetes cluster.

Option 1: Pull from Stardog JFrog Registry

Request access to the Stardog JFrog registry and configure an image pull secret:

kubectl create secret docker-registry stardog-jfrog-secret \
  --docker-server=stardog-stardog-apps.jfrog.io \
  --docker-username=YOUR_USERNAME \
  --docker-password=YOUR_PASSWORD \
  --docker-email=YOUR_EMAIL \
  --namespace=default

Option 2: Push to Your Private Registry

  1. Pull the image from Stardog:
    docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current
    
  2. Tag and push to your registry:
    docker tag stardog-stardog-apps.jfrog.io/voicebox-bites:current \
      your-registry.com/voicebox-bites:current
    
    docker push your-registry.com/voicebox-bites:current
    
  3. Update image reference in vbx_bites_kube_config.yaml

Step 3: Configure vbx_bites_kube_config.yaml

The vbx_bites_kube_config.yaml file defines the Spark application specification.

Sample Configuration:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: voicebox-bites-job
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "stardog-stardog-apps.jfrog.io/voicebox-bites:current"
  imagePullPolicy: Always
  imagePullSecrets:
    - stardog-jfrog-secret
  mainApplicationFile: local:///app/src/voicebox_bites/etl/bulk_document_extraction.py
  sparkVersion: "3.5.0"
  restartPolicy:
    type: Never
  driver:
    cores: 2
    coreLimit: "2000m"
    memory: "4g"
    labels:
      version: 3.5.0
    serviceAccount: spark-operator
  executor:
    cores: 2
    instances: 3
    memory: "4g"
    labels:
      version: 3.5.0

Key Configuration Sections:

Image Configuration:

image: "your-registry.com/voicebox-bites:current"
imagePullPolicy: Always
imagePullSecrets:
  - your-image-pull-secret

Driver Configuration (controls the Spark driver):

driver:
  cores: 2              # CPU cores for driver
  coreLimit: "2000m"    # Maximum CPU (Kubernetes format)
  memory: "4g"          # Memory allocation
  serviceAccount: spark-operator

Executor Configuration (controls the Spark executors):

executor:
  cores: 2              # CPU cores per executor
  instances: 3          # Number of executor pods
  memory: "4g"          # Memory per executor

Sizing Guidelines:

Dataset Size Files Executor Instances Executor Memory Executor Cores
Small <100 2-3 4g 2
Medium 100-1000 4-6 8g 4
Large 1000-10000 8-12 16g 4
Very Large >10000 15-30 16g 4

Important: BITES does not support Kubernetes autoscaling. Configure a fixed number of executor instances and do not scale down while jobs are running.

Step 4: Configure voicebox-service

The voicebox-service needs to know where to find the Spark configuration.

Set Environment Variable:

env:
  - name: VBX_BITES_CONFIG_FILE
    value: "/config/vbx_bites_kube_config.yaml"

Mount Configuration File:

volumes:
  - name: bites-config
    configMap:
      name: vbx-bites-config

volumeMounts:
  - name: bites-config
    mountPath: /config

Create ConfigMap:

kubectl create configmap vbx-bites-config \
  --from-file=vbx_bites_kube_config.yaml \
  --namespace=default

Step 5: Configure RBAC

Ensure the voicebox-service has permissions to manage Spark applications.

Create Service Account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-operator
  namespace: default

Create Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-operator-role
  namespace: default
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["create", "get", "list", "delete", "update", "watch"]
  - apiGroups: ["sparkoperator.k8s.io"]
    resources: ["sparkapplications"]
    verbs: ["create", "get", "list", "delete", "update", "watch"]

Create RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-operator-rolebinding
  namespace: default
subjects:
  - kind: ServiceAccount
    name: spark-operator
    namespace: default
roleRef:
  kind: Role
  name: spark-operator-role
  apiGroup: rbac.authorization.k8s.io

Apply RBAC Configuration:

kubectl apply -f spark-rbac.yaml

Step 6: Configure Networking

Ensure proper network connectivity between components.

Required Connectivity:

  • Launchpad → voicebox-service (API calls)
  • voicebox-service → Kubernetes API (Spark job management)
  • Spark executors → Data sources (Google Drive, S3, etc.)
  • Spark executors → Stardog (indexing)
  • Spark executors → LLM providers (if using content enhancement or information extraction)

Firewall Rules:

  • Allow outbound HTTPS (443) for data source APIs
  • Allow outbound connections to Stardog endpoint
  • Allow outbound connections to LLM provider APIs

Step 7: Deploy voicebox-service

Deploy the voicebox-service with the configured settings.

Cluster Sizing Recommendations

Minimum Cluster Size:

  • 3 nodes
  • 4 CPU cores per node
  • 16 GB RAM per node

Recommended Production Cluster:

  • 5-10 nodes
  • 8 CPU cores per node
  • 32 GB RAM per node
  • 100 GB SSD per node (for temporary storage)

Scaling Considerations:

  • Each executor needs dedicated resources
  • Driver pod requires resources
  • Kubernetes system pods consume resources
  • Leave 20-30% capacity headroom

Do not enable cluster autoscaling for nodes running Spark executors. Scale the cluster before starting large jobs and maintain the size throughout job execution.


Logging

Comprehensive logging is essential for monitoring job execution and troubleshooting issues.

Logging Architecture

BITES provides two layers of logging:

  1. Spark Logging: Framework-level logs (job scheduling, task execution, etc.)
  2. BITES Application Logging: Application-level logs (document processing, API calls, etc.)

Spark Logging Configuration

Spark logging is configured using Log4j properties.

Setup Steps

  1. Create log4j.properties:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Reduce verbosity of some packages
log4j.logger.org.apache.spark.storage=WARN
log4j.logger.org.apache.spark.scheduler=WARN
log4j.logger.org.apache.spark.util.Utils=WARN
log4j.logger.org.apache.spark.executor=INFO

Logging Levels:

  • ERROR: Only errors
  • WARN: Warnings and errors
  • INFO: Informational messages (recommended for production)
  • DEBUG: Detailed debug information (use for troubleshooting only)
  • TRACE: Very verbose (not recommended)

DEBUG and TRACE levels generate extremely large log volumes. Use only for troubleshooting specific issues.

  1. Create ConfigMap:
kubectl create configmap spark-log4j-config \
  --from-file=log4j.properties \
  --namespace=default
  1. Update vbx_bites_kube_config.yaml:

Add the following under spec:

spec:
  sparkConf:
    "spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
    "spark.executor.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"

  driver:
    configMaps:
      - name: spark-log4j-config
        path: /opt/spark

  executor:
    configMaps:
      - name: spark-log4j-config
        path: /opt/spark
  1. Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml

BITES Application Logging

Application logging provides insights into document processing, API interactions, and business logic.

Setup Steps

  1. Create logging.conf:
[loggers]
keys=root,py4j

[logger_py4j]
level=WARN
handlers=nullHandler
qualname=py4j
propagate=0

[handlers]
keys=consoleHandler,nullHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=INFO
handlers=consoleHandler

[handler_nullHandler]
class=logging.NullHandler
level=CRITICAL
args=()

[handler_consoleHandler]
class=voicebox_bites.logging_setup.FlushingStreamHandler
level=INFO
formatter=simpleFormatter
args=(sys.stdout,)

[formatter_simpleFormatter]
format=%(asctime)s %(levelname)s [%(job_id)s] %(name)s - %(message)s

Logging Levels:

  • INFO: Recommended for production
  • DEBUG: Detailed processing information (use for troubleshooting)
  1. Create ConfigMap:
kubectl create configmap voicebox-bites-log-config \
  --from-file=logging.conf \
  --namespace=default
  1. Update vbx_bites_kube_config.yaml:

Add the following to both driver and executor sections:

driver:
  volumeMounts:
    - name: vbx-bites-logging-config-volume
      mountPath: /app/etc/logging.conf
      subPath: logging.conf

executor:
  volumeMounts:
    - name: vbx-bites-logging-config-volume
      mountPath: /app/etc/logging.conf
      subPath: logging.conf

# Add under spec.volumes
volumes:
  - name: vbx-bites-logging-config-volume
    configMap:
      name: voicebox-bites-log-config
  1. Apply Configuration:
kubectl apply -f vbx_bites_kube_config.yaml

Custom Log Path:

If you need to use a different path, set the VOICEBOX_BITES_LOG_CONF environment variable:

env:
  - name: VOICEBOX_BITES_LOG_CONF
    value: "/custom/path/logging.conf"

Accessing Logs

View Driver Logs

# Get driver pod name
DRIVER_POD=$(kubectl get pods -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')

# View logs
kubectl logs $DRIVER_POD

# Follow logs in real-time
kubectl logs -f $DRIVER_POD

# Save logs to file
kubectl logs $DRIVER_POD > driver.log

View Executor Logs

# List executor pods
kubectl get pods -l spark-role=executor

# View specific executor logs
kubectl logs voicebox-bites-job-exec-1

# View all executor logs
kubectl logs -l spark-role=executor

# Follow executor logs
kubectl logs -f voicebox-bites-job-exec-1

Centralized Logging

For production deployments with many executors, use centralized logging.

Troubleshooting

This section covers common issues and their solutions.

Permission Issues

Symptom

Error: Failed to create SparkApplication: User cannot create resource "sparkapplications"

Diagnosis

Check if the service account has the required permissions:

# Check permissions
kubectl auth can-i create sparkapplications --as=system:serviceaccount:default:spark-operator

# Check current role bindings
kubectl get rolebindings -o wide | grep spark-operator

Solution

Ensure proper RBAC configuration:

# Verify service account exists
kubectl get serviceaccount spark-operator

# Verify role exists and has correct permissions
kubectl describe role spark-operator-role

# Verify role binding
kubectl describe rolebinding spark-operator-rolebinding

# If missing, apply RBAC configuration
kubectl apply -f spark-rbac.yaml

See Step 5: Configure RBAC for complete RBAC configuration.

Data Source Authentication Errors

Google Drive: Authentication Failed

Symptom:

ERROR: Authentication failed: Invalid credentials

Common Causes:

  1. Service account JSON is not properly base64 encoded
  2. Service account doesn’t have access to the folder
  3. API not enabled in Google Cloud project

Solutions:

  1. Verify base64 encoding:
    # Encode correctly
    cat service-account.json | base64 -w 0
    
    # Test decoding
    echo "YOUR_BASE64_STRING" | base64 -d | jq .
    
  2. Share folder with service account:
    • Copy client_email from service account JSON
    • Share the Google Drive folder with this email
    • Grant at least “Viewer” permissions
  3. Enable Google Drive API:
    • Go to Google Cloud Console
    • Navigate to “APIs & Services” → “Library”
    • Search for “Google Drive API”
    • Click “Enable”

OneDrive/SharePoint: Invalid Client

Symptom:

ERROR: AADSTS7000215: Invalid client secret provided

Solutions:

  1. Regenerate client secret:
    • Secrets expire after a configured period
    • Go to Azure Portal → App registrations → Your app → Certificates & secrets
    • Create a new secret
    • Update credentials JSON and re-encode
  2. Verify permissions:
    • Check that all required permissions are granted
    • Ensure admin consent has been provided
    • Wait 5-10 minutes after granting permissions

S3: Access Denied

Symptom:

ERROR: Access Denied (Service: Amazon S3; Status Code: 403)

Solutions:

  1. Verify IAM permissions:
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
    
  2. Check bucket policy:
    • Ensure the bucket policy doesn’t deny access
    • Verify the IAM role/user is allowed
  3. Verify region:
    • Ensure region_name in credentials matches bucket region

Job Execution Issues

Job Stuck in SUBMITTED State

Symptom: Job status remains “SUBMITTED” for extended period.

Diagnosis:

# Check Spark Operator logs
kubectl logs -n spark-operator -l app=spark-operator

# Check pending pods
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor

# Describe pending pods
kubectl describe pod $DRIVER_POD_NAME

Common Causes:

  1. Insufficient cluster resources
  2. Image pull errors
  3. RBAC issues

Solutions:

  1. Insufficient resources:
    # Check node resources
    kubectl describe nodes
    
    # Solution: Scale cluster or reduce resource requests
    
  2. Image pull errors:
    # Check events
    kubectl get events --sort-by='.lastTimestamp'
    
    # Solution: Verify image pull secret and image URL
    

Job Fails with Out of Memory (OOM) Error

Symptom:

ERROR: Executor lost: OutOfMemoryError: Java heap space

Solutions:

  1. Increase executor memory:
    executor:
      memory: "8g"  # Increase from 4g
    
  2. Reduce parallelism:
    {
      "content_reader_parallelism": 10  // Reduce from 30
    }
    
  3. Reduce batch size:
    {
      "batch_size": 500  // Reduce from 1000
    }
    
  4. Increase number of executors (distribute load):
    executor:
      instances: 6  # Increase from 3
      memory: "4g"  # Keep same memory per executor
    

Job Fails with Rate Limit Errors

Symptom:

WARN: Rate limit exceeded for Google Drive API

Solutions:

  1. Reduce parallelism:
    {
      "list_file_parallelism": 3,
      "content_reader_parallelism": 5
    }
    

2 Request quota increase from data source provider.

Indexing Issues

Documents Not Appearing in Voicebox

Diagnosis Steps:

  1. Verify job completed successfully:
    curl -X GET "https://your-launchpad-url/api/v1/voicebox/bites/jobs/$JOB_ID" \
      -H "Authorization: Bearer YOUR_API_KEY"
    
  2. Check driver logs for indexing confirmation:
    kubectl logs $DRIVER_POD | grep "Indexed"
    
  3. Verify Stardog contains data:
    SELECT (COUNT(*) as ?count) WHERE {
      ?s ?p ?o
    }
    

Solutions:

  1. Job failed silently:
    • Check logs for errors
    • Rerun job with DEBUG logging
  2. Wrong database or graph:
    • Verify Stardog connection details
    • Check that Voicebox is querying the correct database/graph
  3. Stardog connectivity issue:
    • Test connectivity from Spark executor to Stardog
    • Check firewall rules

Logging Issues

No Logs Appearing

Diagnosis:

# Check if pods exist
kubectl get pods -l spark-role=driver
kubectl get pods -l spark-role=executor

# Check pod status
kubectl describe pod $POD_NAME

# Check if ConfigMaps mounted correctly
kubectl exec $DRIVER_POD -- ls -la /opt/spark
kubectl exec $DRIVER_POD -- ls -la /app/etc

Solutions:

  1. ConfigMap not mounted:
    • Verify ConfigMap exists: kubectl get configmap
    • Check volumeMounts in pod spec
    • Verify path in VOICEBOX_BITES_LOG_CONF
  2. Wrong log level:
    • Check logging configuration
    • Ensure not set to ERROR or CRITICAL only
  3. Logs going to wrong destination:
    • Verify log handler configuration
    • Check stdout/stderr redirection

Logs Too Verbose

Solution:

Change log level from DEBUG to INFO:

For Spark Logs (log4j.properties):

log4j.rootCategory=INFO, console

For BITES Logs (logging.conf):

[logger_root]
level=INFO

Network Issues

Cannot Access Data Source

Symptom:

ERROR: Connection timeout when accessing data source

Solutions:

  1. Firewall blocking outbound connections:
    • Allow HTTPS (443) egress
    • Add data source domains to allow list
  2. Network policy blocking traffic:
    • Review network policies
    • Add exception for voicebox-bites pods
  3. DNS resolution issues:
    • Check DNS configuration in cluster
    • Verify CoreDNS is functioning

Cannot Connect to Stardog

Symptom:

ERROR: Connection refused: Stardog endpoint

Diagnosis:

# Test from driver pod
kubectl exec $DRIVER_POD -- curl -v http://stardog:5820/

# Check Stardog service
kubectl get svc stardog

Solutions:

  1. Stardog not accessible:
    • Verify Stardog is running
    • Check service endpoint
    • Verify network policies
  2. Wrong endpoint:
    • Check Stardog connection configuration
    • Verify port (default 5820)

Getting Help

If you cannot resolve an issue:

  1. Collect diagnostic information:
    • Job ID
    • Driver and executor logs
    • Spark Operator logs
    • Job configuration
    • Error messages
  2. Check documentation:

Docker Image Availability

Latest Version

Pull the most recent voicebox-bites image:

docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current

Specific Versions

Pull a specific version (e.g., v0.2.0):

docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:v0.2.0

Additional Resources