Using Unstructured Data with Voicebox
This guide explains how developers can integrate and manage unstructured data effectively with Stardog Voicebox.
Page Contents
Overview
BITES (Blob Indexing and Text Enrichment with Semantics), or unstructured data support, enables users to ingest data from sources like Google Drive and MS One Drive into Voicebox. This allows users to query both structured and unstructured data through Voicebox’s conversational AI chat interface.
Supported Data sources:
- Google Drive
- MS One Drive
- local storage
Supported document formats:
- Microsoft Word (DOCX)
Presently, the system facilitates the parsing and indexing of textual and tabular data. Image parsing within these documents is scheduled for future releases.
Key Functionality
- Data Ingestion: Data from Google Drive, MS One Drive, and local files can be ingested.
- Unified Querying: Users can query both structured and unstructured data through the Voicebox chat interface.
- API Access: Currently, BITES functionality is accessible exclusively through Launchpad’s public APIs. Obtain an API key from the ‘Manage API Keys’ page before proceeding.
- Job Management: APIs interact with the Voicebox service’s Job Management interfaces, which trigger jobs in the customer’s Kubernetes environment using the Spark operator.
- Job Control: Users can manage jobs via APIs to get job status or cancel jobs.
- Data Processing: Spark jobs read data, parse it, optionally enhance it with an LLM, and index it in Stardog (vector store).
Architecture Diagram
- User initiates data ingestion and indexing via Launchpad’s public APIs.
- Voicebox service triggers a Spark job in the Kubernetes environment.
- Spark job processes data and indexes it in Stardog.
- User queries Voicebox, which retrieves answers from both structured and unstructured data.
Requirements
Voicebox requires the user to perform the below steps so the BITES indexing APIs can read the data sources.
Configuring Data Sources
Google Drive
- Go to Google Auth Platform and create a client.
- Go to Data Access and add scope.
- Go to IAM & Admin.
- Create a Service Account. Go to the Keys tab.
- Create a key using the Add Key option. Download the JSON file.
Sample service account JSON File:
{
"type": "service_account",
"project_id": "bites-2",
"private_key_id": "4980a35b12af8ff38c0191f76c8f3",
"private_key": "---BEGIN PRIVATE KEY---\nMIIEvgIBADANBgk+kM\n-----END PRIVATE KEY---\n",
"client_email": "bites-service-account@bites-2.iam.gserviceaccount.com",
"client_id": "111297684465611305582",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bites-service-account%40bites-2.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
MS Azure OneDrive
The following information will be needed by the Voicebox Bites indexing job to read from MS OneDrive.
- Application (client) ID / Client Id - Available on the overview page of the registered app in the Azure Portal.
- Directory (tenant) ID / Tenant Id - Available on the overview page of the registered app in the Azure Portal.
- Secret:
- Go to the “Certificates & Secrets” tab and create the secret.
- Copy the secret value (masked value in the screenshot) that needs to be provided for Secret.
The following permissions need to be set for Microsoft Graph:
- Files.Read
- Files.ReadAll (Delegated)
- Files.ReadAll (Application)
- offline_access(Delegated)
- openid(Delegated)
- Sites.ReadAll(Delegated)
-
User.Read(Delegated)
Sample JSON:
{
"tenant_id":"ff34ca66-bbaa-4def-8acf-445ada42",
"client_id":"3d4c893c-4d87-984e-ffa7c53d0454",
"client_secret":"l.28Q~D4h.Q46Kc0_pZ5kupW_wTZ0iagl"
}
Setup
This section covers the steps you need to perform before running BITES.
Get the API Key
To perform indexing, the user must first generate an API Key for the required database via the “Manage API Keys” page.
Running the voicebox-bites image
Within a Kubernetes environment, ensure that both the voicebox-bites
and voicebox-service
containers are running concurrently.
Connections within Launchpad
There are two ways you can create the Stardog connection.
- Directly using Stardog username and password:
- Configure Stardog to generate tokens.
- When you initiate the job, Stardog will generate token and pass it to the job.
- Configuring and using SSO (Azure/Okta/Ping):
- Pass the refresh token and SSO provider client id in the api.
- When you initiate the job, Stardog will call the SSO provider to fetch the access token. This token will then be passed to the job to connect to Stardog.
The token expiry has to be set depending on the number of files to be processed and thus the tentative job execution duration. We recommend you set this to 30 days.
Indexing APIs
This section covers public APIs to manage the indexing job from Launchpad. Remember to include your Launchpad API Key with each request.
initiate_indexing_job
- Takes arguments like:
- Store service account credentials
- Stardog details for e.g., endpoint, database, graph
- Indexing job configuration.
- Takes arguments like:
get_job_status
cancel_indexing_job
Currently, there is no UI for triggering or managing these jobs. You must use the provided APIs. There is also no Job History yet, so you must manage this yourself.
Initiate Indexing Job
Path
/api/v1/voicebox/bites/jobs
Request structure
-
directory(str): The directory location. It should be the directory id (Google Drive), the folder path (OneDrive), or path to the directory (local storage). In the latter case, the directory must be accessible from the Kubernetes environment.
-
credentials (str): The base64 encoded string that will be used to access the folder/directory from the previous bullet.
-
JSON parameters needed to connect to the data source needs to be base64 encoded and passed through this argument.
- Sample JSON for Google Drive:
{ "type": "service_account", "project_id": "bites-2", "private_key_id": "4980a35b12af8ff38c0191f76c8f3", "private_key": "---BEGIN PRIVATE KEY---\nMIIEvgIBADANBgk+kM\n-----END PRIVATE KEY---\n", "client_email": "bites-service-account@bites-2.iam.gserviceaccount.com", "client_id": "111297684465611305582", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bites-service-account%40bites-2.iam.gserviceaccount.com", "universe_domain": "googleapis.com" }
- Sample JSON for MS One Drive:
{ "tenant_id":"ff24ca66-dbaa-4def-8acf-43f2635ada42", "client_id":"3d4c-a24c-4d87-984e-ffa7c554", "client_secret":"l.28SSSEcsfI-D4h.Q46Kc0_pZ5kupW_wTZ0iagl" }
- Sample JSON for Google Drive:
-
-
batch_size (int): Batch size for the job. This value sets the number of chunks that need to be committed at a time. Each document is divided into multiple smaller chunks. Modifying the value may impact performance. The default is 1000.
-
sso_provider_client_id (str | None): Client ID of the SSO provider. This is mandatory for SSO connections. The default is
None
. -
refresh_token (str | None): Refresh token of SSO provider to get access token. This is mandatory for SSO connections. The default is
None
. -
job_config (dict | None): Using this configuration, you can control both the scalability and functionality of your indexing job. Various settings are exposed to control the scalability and functionality of the different steps in the indexing job.
- Sample configuration for this parameter:
{ "list_file_parallelism": 5, "content_reader_parallelism": 10, "content_indexer_parallelism": 5, "document_store_type": "google_drive", "enhance_content": false, "store_list_file_config": { "page_size": 100, "recursive": true, "document_types": [ "document", "pdf" ], }, "document_loader_config": { "pdf": { "chunk_size": 1000, "chunking_enabled": true, "chunk_separator": [ "\n\n", "\n", " ", "" ], "chunk_overlap": 0, }, "document": { "chunk_size": 500, "chunking_enabled": true, "chunk_separator": [ "\n\n", "\n", " ", "" ], "chunk_overlap": 0, } }, "content_enhancer_config": { "llm_config": { "max_tokens": 10000, "temperature": 0.0, "repetition_penalty": 1.0, "top_p": 0.7, "top_k": 50, "stop": [ "<|eot_id|>", "<|end_of_text|>" ], "llm_name": "accounts/fireworks/models/llama-v3p1-70b-instruct", "llm_provider": "fireworks", "context_window": 4000, "instruct_format": "llama3-instruct" }, }, }
- Sample configuration for this parameter:
-
job_name (str): Name of the job to be created in the Spark environment.
-
job_namespace (str): Namespace of the job to be created. The default value is an empty string. By default, the Spark job utilizes the namespace provided in
vbx_bites_kube_config.yaml
, unless provided here. -
extra_args (dict | none): Additional JSON arguments to be passed to the job. This is optional. Here is the list of supported arguments:
- one_drive_id: Drive ID in case MS One Drive is the datasource.
- Example:
{“one_drive_id”: “drive id here”}
Response structure
-
job_id (str | None): Job ID of the job created.
-
error (str | None): Error message (if any) if the job fails to get created in the Spark environment.
Get Status
Returns the current status of the job.
Path
/api/v1/voicebox/bites/jobs/{job_id}
-
job_id (str | None): Job ID of the job created.
-
error (str | None): Error message (if any) if the job fails to get created in Spark environment.
-
job_namespace (str): Namespace of the job created. The default value is an empty string. By default, the Spark job utilizes the namespace provided in
vbx_bites_kube_config.yaml
, unless provided here.
Response Structure
- status_code: Status code of the job.
- Possible values of
status_code
: - NEW
- SUBMITTED
- RUNNING
- PENDING_RERUN
- INVALIDATING
- SUCCEEDING
- COMPLETED
- ERROR
- FAILING
- FAILED
- UNKNOWN
- Possible values of
- status (str): Status message of the job.
Cancel Job
Cancels the running job.
Path
/api/v1/voicebox/bites/jobs/{job_id}/cancel
Request Structure
-
job_name (str): Name of the job to be canceled.
-
job_namespace (str): Namespace of the job to be canceled. The default value is an empty string. By default, the Spark job utilizes the namespace provided in
vbx_bites_kube_config.yaml
unless provided here.
Response Structure
-
success (boolean): Success message of the job.
-
error (str | None): Error message (if any) if the job fails to get deleted from Spark environment.
Document Indexing Pipeline Overview
The indexing pipeline for documents involves a series of critical tasks and jobs designed to ingest, process, and prepare content for efficient retrieval and use. These tasks ensure that a wide range of documents are properly indexed and made searchable.
Key stages in the indexing pipeline:
- List directories of a store: The initial step involves identifying and enumerating all the directories within a designated store. This ensures the indexing process covers the entirety of the storage location.
- List supported files: Once the directories are listed, the next task is to identify and filter for supported file types. This step is crucial to ensure only relevant and processable files are included in the indexing process, while excluding any irrelevant or incompatible file formats.
- Fetch content and metadata: After supported files are identified, their content and associated metadata must be fetched. This includes retrieving the actual text or data within the file, as well as capturing important metadata such as file names, creation dates, modification dates, author information, and other relevant attributes.
- Parsing and chunking content: The fetched content often needs to be parsed and broken down into smaller, more manageable chunks. Chunking allows for more granular indexing and improves the precision of search results.
- Enhance content (optional): Content enhancement involves enriching the extracted tables within the text with additional information or transformations. Enhancement improves search relevancy and provides deeper insights into the content. This step involves additional LLM calls.
- Index chunks: Finally, the processed and enhanced chunks are indexed, making them searchable. This step typically involves storing the chunks and their associated metadata in Stardog so they are optimized for quick and efficient retrieval. The index is the foundation for search functionality, allowing users to find relevant information within the processed documents.
Job Configuration
Various settings are exposed to control the scalability and functionality of the different steps in the indexing job. All of these settings are combined in a final JSON file that has to be base64 encoded and passed as an argument to the initiate_indexing_job
API.
Scalability Configurations
-
list_file_parallelism: Use this if you configure your indexing job to fetch documents recursively (i.e., all documents exist in sub-directories of the initially provided directory). Based on this setting, the indexing job will fetch the file names and their metadata in parallel from the different directories. The default value is 5. This setting has no effect when the job is not configured to fetch documents recursively.
- content_reader_parallelism: This setting controls how the indexing job will be scaled while crawling the content of the files from the document store and parsing them. This setting can significantly impact the performance of the indexing job. We recommend striking a balance between the number of files to index and the number of cores available for the indexing cluster. The default value is 10.
- Suppose you are indexing a huge dataset with thousands of files. To achieve optimal performance, we recommend setting this value to a number close to the maximum number of cores available for the indexing cluster.
- Keep in mind this setting will also have memory implications, so nodes in the indexing cluster should have sufficient memory to handle it.
- content_indexer_parallelism: The indexing phase can be scaled by modifying this value. The indexing phase ingest the chunks in the Stardog vector store.
We make use of document store APIs to fetch the file metadata and content of the files. For example, Google APIs in the case of Google Drive. Keep in mind the rate limit while setting the values of list_file_parallelism
and content_reader_parallelism
.
Functional Configuration
-
document_store_type - Set this value to one of the supported document stores (
google_drive
/onedrive
/local
). -
enhance_content - Set this value to
true
if you want to enhance the content using LLM. This includes the chunking of pages and parsing of tables. - store_list_file_config: Controls how files are discovered before loading and chunking.
page_size
: Maximum number of files’ metadata to read per iteration (pagination) from the document store. For e.g., the Google store, it controls batch size while listing the files. The default value is 100.recursive
: Iftrue
, files are searched for in all subdirectories recursively.document_types
: Specifies the file types (e.g., “document”, “pdf”) that should be included for processing.- Sample configuration:
{ "store_list_file_config": { "page_size": 100, "recursive": true, "document_types": [ "document", "pdf" ], }, }
- document_loader_config: This section defines how to load and chunk documents based on their type. For each document type (e.g., “pdf” and “document”):
chunk_size
: Number of characters per chunk to be created from the document content.chunking_enabled
: Enables or disables chunking; iffalse
, the full document is treated as a single chunk.chunk_separator
: Ordered list of separators used to split text into logical chunks ("\n\n"
,"\n"
," "
, or""
).chunk_overlap
: Number of overlapping characters between consecutive chunks; useful for preserving context.- Sample configuration:
{ "document_loader_config": { "pdf": { "chunk_size": 1000, "chunking_enabled": true, "chunk_separator": [ "\n\n", "\n", " ", "" ], "chunk_overlap": 0, }, "document": { "chunk_size": 500, "chunking_enabled": true, "chunk_separator": [ "\n\n", "\n", " ", "" ], "chunk_overlap": 0, } } }
Submitting questions
Questions can be asked in the Voicebox UI:
Users can mouse hover over the Document Extracted text to look at the additional information provided for the response (such as file name and page number).
Deployment
Kubernetes Environment Setup
Spark Operator Installation
The voicebox-service
includes a YAML configuration file (vbx_bites_kube_config.yaml
) which requires your customization. This file adheres to the YAML structure defined by the Spark Operator (see here). The voicebox-service
processes this configuration and utilizes the Spark Operator to execute ETL jobs within a Kubernetes environment. These jobs are implemented using Spark within the voicebox-bites
image.
Cluster Recommendations
The current indexing implementation does not support automatic cluster scaling. Consequently, you must manually configure the cluster with a fixed number of nodes. This node count must remain constant throughout the duration of the Spark job and should not be reduced once the job has commenced.
Spark cluster is not supported in auto-scale mode.
Logging
Logging can be enabled using ConfigMaps.
Create log4j.properties
with the below content.
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
For log4j.rootCategory
, you can replace INFO
with DEBUG
to get more detailed logging when troubleshooting. Be cautious, as this increases log volume considerably.
Per spark-operator
specs, add the below content to the config file provided with voicebox-service
(vbx_bites_kube_config.yaml
).
Under the spec
section:
sparkConf:
"spark.driver.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
"spark.executor.extraJavaOptions": "-Dlog4j.configuration=file:/opt/spark/log4j.properties"
Under the driver
and executor
sections:
configMaps:
- name: spark-log4j-config
path: /opt/spark
Availability
The latest Voicebox BITES image is available at:
docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:current
To grab a specific version, e.g., v0.1.1
:
docker pull stardog-stardog-apps.jfrog.io/voicebox-bites:v0.1.1
Troubleshooting
In the event you have permission issues with spark-operator
, ensure roles are assigned appropriately to a service account such that it can manage (create, get, and delete) SparkApplications.
You can check role permissions with:
kubectl auth can-i create sparkapplications
Replace create
with the permission you want to check.