Skip to main content

Image & Audio Byte Operations

Generate images, convert text to speech, and transcribe audio within your structured data workflows. ObjectWeaver's byte operations support advanced features like verbose transcription metadata with word-level timestamps and segment analysis.

Byte operations handle binary data types (type: "byte") for image generation, text-to-speech, and audio transcription. All operations return base64-encoded data with metadata.


Image Generation

Generate images using provider models. You can use any model supported by your provider by specifying its model card name (e.g., dall-e-3).

Basic Usage

{
"type": "byte",
"instruction": "A beautiful sunset over mountains",
"image": {
"model": "dall-e-3",
"size": "1024x1024"
}
}

Response Format

Returns base64-encoded image data:

{
"value": "<base64-encoded-image>",
"metadata": {
"tokensUsed": 0,
"cost": 0.04,
"modelUsed": "dall-e-3"
}
}

Example Request

curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate an image",
"definition": {
"type": "object",
"properties": {
"artwork": {
"type": "byte",
"instruction": "A serene landscape with mountains and a lake at sunset",
"image": {
"model": "dall-e-3",
"size": "1024x1024"
}
}
}
}
}'

Text-to-Speech (TTS)

Convert text to speech. Any provider-supported model and voice can be used by specifying their names.

Basic Usage

{
"type": "byte",
"textToSpeech": {
"stringToAudio": "Hello, welcome to ObjectWeaver!",
"voice": "alloy",
"model": "tts-1",
"speed": 1.0
}
}

Configuration

ParameterTypeDescriptionDefault
stringToAudiostringText to convert to speech (required)-
voicestringVoice name (must match provider options)-
modelstringModel nametts-1
speednumberSpeaking rate (0.25 to 4.0)1.0
responseFormatstringOutput format (e.g., mp3, opus)mp3

Response Format

Returns base64-encoded audio data:

{
"value": "<base64-encoded-audio>",
"metadata": {
"tokensUsed": 0,
"cost": 0.015,
"modelUsed": "tts-1"
}
}

Example Request

curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Create an audio greeting",
"definition": {
"type": "object",
"properties": {
"greeting": {
"type": "byte",
"textToSpeech": {
"stringToAudio": "Welcome to our service!",
"voice": "nova",
"model": "tts-1-hd",
"speed": 1.1
}
}
}
}
}'

Speech-to-Text (STT)

Transcribe audio using provider models. Supports various output formats including verbose JSON with timestamps.

Basic Transcription

{
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-audio-data>",
"language": "en",
"model": "whisper-1",
"toString": true
}
}

Configuration

ParameterTypeDescriptionDefault
audioToTranscribestringBase64-encoded audio data (required)-
languagestringISO 639-1 language codeen
modelstringModel namewhisper-1
responseFormatstringOutput format (e.g., text, verbose_json)text
toStringbooleanReturn plain text formattrue
toCaptionsbooleanReturn SRT caption formatfalse
promptstringContext to improve accuracy-

Response Formats

ObjectWeaver supports standard provider response formats such as text, json, srt, vtt, and verbose_json. Refer to your provider's documentation for details on each format's structure and capabilities.

Basic Response Format

For simple formats (text, json):

{
"value": "This is the transcribed text.",
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"modelUsed": "whisper-1"
}
}

Verbose JSON Output

When using verbose_json or diarized_json response formats, the transcription includes detailed metadata with segment-level timing, word-level timestamps, and quality metrics.

Enabling Verbose Output

{
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-audio-data>",
"language": "en",
"model": "whisper-1",
"responseFormat": "verbose_json"
}
}

Response Structure

{
"value": "Complete transcribed text of the audio.",
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"modelUsed": "whisper-1",
"verboseData": {
"language": "en",
"duration": 15.5,
"segments": [...],
"words": [...]
}
}
}

VerboseData Structure

The verboseData object contains rich metadata about the transcription:

Top-Level Fields

FieldTypeDescription
languagestringDetected or specified language code (ISO 639-1)
durationnumberTotal audio duration in seconds
segmentsarrayArray of transcription segments with timing and metadata
wordsarrayArray of individual word timestamps

Segment Structure

Each segment includes detailed timing and quality metrics:

{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.5,
"text": "This is the first segment.",
"tokens": [1234, 5678, 9012],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.2,
"no_speech_prob": 0.01,
"transient": false
}
FieldTypeDescription
idintegerSegment identifier (sequential)
seekintegerSeek position in the audio
startnumberSegment start time in seconds
endnumberSegment end time in seconds
textstringTranscribed text for this segment
tokensarrayToken IDs used in transcription
temperaturenumberTemperature used for generation
avg_logprobnumberAverage log probability (quality indicator)
compression_rationumberText compression ratio
no_speech_probnumberProbability segment contains no speech
transientbooleanWhether segment is transient/temporary

Word Structure

Each word includes precise timing information:

{
"word": "transcribed",
"start": 0.0,
"end": 0.5
}
FieldTypeDescription
wordstringThe individual word text
startnumberWord start time in seconds
endnumberWord end time in seconds

Complete Example

curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Transcribe with detailed metadata",
"definition": {
"type": "object",
"properties": {
"detailedTranscription": {
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-encoded-audio>",
"language": "en",
"model": "whisper-1",
"responseFormat": "verbose_json"
}
}
}
}
}'

Use Cases for Verbose Data

The verbose metadata enables advanced use cases:

  1. Subtitle Generation: Use segment timing to create accurate subtitles with precise timestamps
  2. Word-Level Synchronization: Sync visual elements or animations with specific words in the audio
  3. Quality Assessment: Analyze avg_logprob and no_speech_prob to assess transcription confidence and reliability
  4. Speaker Analysis: With diarized_json, identify and separate different speakers in conversations
  5. Audio Editing: Use timestamps to programmatically edit or segment audio files based on content
  6. Accessibility: Create enhanced captions with precise timing for improved accessibility
  7. Content Analysis: Analyze speech patterns, pauses, and segment structure for content insights

Error Handling

Common errors include missing API keys, invalid base64 data, or provider rate limits.

Error Response

{
"error": "descriptive error message",
"metadata": {
"tokensUsed": 0,
"cost": 0,
"modelUsed": "model-name"
}
}

Best Practices

  • Image Generation: Use descriptive prompts and select appropriate aspect ratios.
  • Text-to-Speech: Select voices that match your application's tone.
  • Speech-to-Text: Use verbose_json for timing data and provide a prompt for context to improve accuracy.
  • Performance: Process byte operations in parallel with other fields and cache results where possible.