Image & Audio Byte Operations
Generate images, convert text to speech, and transcribe audio within your structured data workflows. ObjectWeaver's byte operations support advanced features like verbose transcription metadata with word-level timestamps and segment analysis.
Byte operations handle binary data types (type: "byte") for image generation, text-to-speech, and audio transcription. All operations return base64-encoded data with metadata.
Image Generation
Generate images using provider models. You can use any model supported by your provider by specifying its model card name (e.g., dall-e-3).
Basic Usage
{
"type": "byte",
"instruction": "A beautiful sunset over mountains",
"image": {
"model": "dall-e-3",
"size": "1024x1024"
}
}
Response Format
Returns base64-encoded image data:
{
"value": "<base64-encoded-image>",
"metadata": {
"tokensUsed": 0,
"cost": 0.04,
"modelUsed": "dall-e-3"
}
}
Example Request
- cURL
- Go SDK
- Python SDK
curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate an image",
"definition": {
"type": "object",
"properties": {
"artwork": {
"type": "byte",
"instruction": "A serene landscape with mountains and a lake at sunset",
"image": {
"model": "dall-e-3",
"size": "1024x1024"
}
}
}
}
}'
definition := jsonSchema.Definition{
Type: jsonSchema.Object,
Properties: map[string]jsonSchema.Definition{
"artwork": {
Type: jsonSchema.Byte,
Instruction: "A serene landscape with mountains and a lake at sunset",
Image: &jsonSchema.ImageGen{
Model: "dall-e-3",
Size: "1024x1024",
},
},
},
}
definition = Definition(
definition_type="object",
properties={
"artwork": Definition(
definition_type="byte",
instruction="A serene landscape with mountains and a lake at sunset",
image=ImageGen(
model="dall-e-3",
size="1024x1024"
)
)
}
)
Text-to-Speech (TTS)
Convert text to speech. Any provider-supported model and voice can be used by specifying their names.
Basic Usage
{
"type": "byte",
"textToSpeech": {
"stringToAudio": "Hello, welcome to ObjectWeaver!",
"voice": "alloy",
"model": "tts-1",
"speed": 1.0
}
}
Configuration
| Parameter | Type | Description | Default |
|---|---|---|---|
stringToAudio | string | Text to convert to speech (required) | - |
voice | string | Voice name (must match provider options) | - |
model | string | Model name | tts-1 |
speed | number | Speaking rate (0.25 to 4.0) | 1.0 |
responseFormat | string | Output format (e.g., mp3, opus) | mp3 |
Response Format
Returns base64-encoded audio data:
{
"value": "<base64-encoded-audio>",
"metadata": {
"tokensUsed": 0,
"cost": 0.015,
"modelUsed": "tts-1"
}
}
Example Request
- cURL
- Go SDK
- Python SDK
curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Create an audio greeting",
"definition": {
"type": "object",
"properties": {
"greeting": {
"type": "byte",
"textToSpeech": {
"stringToAudio": "Welcome to our service!",
"voice": "nova",
"model": "tts-1-hd",
"speed": 1.1
}
}
}
}
}'
definition := jsonSchema.Definition{
Type: jsonSchema.Object,
Properties: map[string]jsonSchema.Definition{
"greeting": {
Type: jsonSchema.Byte,
TextToSpeech: &jsonSchema.TextToSpeech{
StringToAudio: "Welcome to our service!",
Voice: "nova",
Model: "tts-1-hd",
Speed: 1.1,
ResponseFormat: "mp3",
},
},
},
}
definition = Definition(
definition_type="object",
properties={
"greeting": Definition(
definition_type="byte",
text_to_speech=TextToSpeech(
string_to_audio="Welcome to our service!",
voice="nova",
model="tts-1-hd",
speed=1.1,
response_format="mp3"
)
)
}
)
Speech-to-Text (STT)
Transcribe audio using provider models. Supports various output formats including verbose JSON with timestamps.
Basic Transcription
{
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-audio-data>",
"language": "en",
"model": "whisper-1",
"toString": true
}
}
Configuration
| Parameter | Type | Description | Default |
|---|---|---|---|
audioToTranscribe | string | Base64-encoded audio data (required) | - |
language | string | ISO 639-1 language code | en |
model | string | Model name | whisper-1 |
responseFormat | string | Output format (e.g., text, verbose_json) | text |
toString | boolean | Return plain text format | true |
toCaptions | boolean | Return SRT caption format | false |
prompt | string | Context to improve accuracy | - |
Response Formats
ObjectWeaver supports standard provider response formats such as text, json, srt, vtt, and verbose_json. Refer to your provider's documentation for details on each format's structure and capabilities.
Basic Response Format
For simple formats (text, json):
{
"value": "This is the transcribed text.",
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"modelUsed": "whisper-1"
}
}
Verbose JSON Output
When using verbose_json or diarized_json response formats, the transcription includes detailed metadata with segment-level timing, word-level timestamps, and quality metrics.
Enabling Verbose Output
{
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-audio-data>",
"language": "en",
"model": "whisper-1",
"responseFormat": "verbose_json"
}
}
Response Structure
{
"value": "Complete transcribed text of the audio.",
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"modelUsed": "whisper-1",
"verboseData": {
"language": "en",
"duration": 15.5,
"segments": [...],
"words": [...]
}
}
}
VerboseData Structure
The verboseData object contains rich metadata about the transcription:
Top-Level Fields
| Field | Type | Description |
|---|---|---|
language | string | Detected or specified language code (ISO 639-1) |
duration | number | Total audio duration in seconds |
segments | array | Array of transcription segments with timing and metadata |
words | array | Array of individual word timestamps |
Segment Structure
Each segment includes detailed timing and quality metrics:
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.5,
"text": "This is the first segment.",
"tokens": [1234, 5678, 9012],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.2,
"no_speech_prob": 0.01,
"transient": false
}
| Field | Type | Description |
|---|---|---|
id | integer | Segment identifier (sequential) |
seek | integer | Seek position in the audio |
start | number | Segment start time in seconds |
end | number | Segment end time in seconds |
text | string | Transcribed text for this segment |
tokens | array | Token IDs used in transcription |
temperature | number | Temperature used for generation |
avg_logprob | number | Average log probability (quality indicator) |
compression_ratio | number | Text compression ratio |
no_speech_prob | number | Probability segment contains no speech |
transient | boolean | Whether segment is transient/temporary |
Word Structure
Each word includes precise timing information:
{
"word": "transcribed",
"start": 0.0,
"end": 0.5
}
| Field | Type | Description |
|---|---|---|
word | string | The individual word text |
start | number | Word start time in seconds |
end | number | Word end time in seconds |
Complete Example
- cURL
- Response
curl -X POST http://localhost:2008/api/objectGen \
-H "Authorization: Bearer your-api-token" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Transcribe with detailed metadata",
"definition": {
"type": "object",
"properties": {
"detailedTranscription": {
"type": "byte",
"speechToText": {
"audioToTranscribe": "<base64-encoded-audio>",
"language": "en",
"model": "whisper-1",
"responseFormat": "verbose_json"
}
}
}
}
}'
{
"data": {
"detailedTranscription": "Complete transcription text here."
},
"detailedData": {
"detailedTranscription": {
"value": "Complete transcription text here.",
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"modelUsed": "whisper-1",
"verboseData": {
"language": "en",
"duration": 15.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.5,
"text": "First segment of speech.",
"tokens": [1234, 5678],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.2,
"no_speech_prob": 0.01
},
{
"id": 1,
"start": 3.5,
"end": 7.2,
"text": "Second segment of speech.",
"tokens": [9012, 3456],
"temperature": 0.0,
"avg_logprob": -0.22,
"compression_ratio": 1.3,
"no_speech_prob": 0.02
}
],
"words": [
{"word": "First", "start": 0.0, "end": 0.5},
{"word": "segment", "start": 0.5, "end": 1.0},
{"word": "of", "start": 1.0, "end": 1.2},
{"word": "speech", "start": 1.2, "end": 1.8}
]
}
}
}
},
"metadata": {
"tokensUsed": 0,
"cost": 0.006,
"duration": 1234567890,
"fieldCount": 1
}
}
Use Cases for Verbose Data
The verbose metadata enables advanced use cases:
- Subtitle Generation: Use segment timing to create accurate subtitles with precise timestamps
- Word-Level Synchronization: Sync visual elements or animations with specific words in the audio
- Quality Assessment: Analyze
avg_logprobandno_speech_probto assess transcription confidence and reliability - Speaker Analysis: With
diarized_json, identify and separate different speakers in conversations - Audio Editing: Use timestamps to programmatically edit or segment audio files based on content
- Accessibility: Create enhanced captions with precise timing for improved accessibility
- Content Analysis: Analyze speech patterns, pauses, and segment structure for content insights
Error Handling
Common errors include missing API keys, invalid base64 data, or provider rate limits.
Error Response
{
"error": "descriptive error message",
"metadata": {
"tokensUsed": 0,
"cost": 0,
"modelUsed": "model-name"
}
}
Best Practices
- Image Generation: Use descriptive prompts and select appropriate aspect ratios.
- Text-to-Speech: Select voices that match your application's tone.
- Speech-to-Text: Use
verbose_jsonfor timing data and provide apromptfor context to improve accuracy. - Performance: Process byte operations in parallel with other fields and cache results where possible.