Image & Audio Byte Operations

Generate images, convert text to speech, and transcribe audio within your structured data workflows. ObjectWeaver's byte operations support advanced features like verbose transcription metadata with word-level timestamps and segment analysis.

Byte operations handle binary data types (type: "byte") for image generation, text-to-speech, and audio transcription. All operations return base64-encoded data with metadata.

Image Generation

Generate images using provider models. You can use any model supported by your provider by specifying its model card name (e.g., dall-e-3).

Basic Usage

{
  "type": "byte",
  "instruction": "A beautiful sunset over mountains",
  "image": {
    "model": "dall-e-3",
    "size": "1024x1024"
  }
}

Response Format

Returns base64-encoded image data:

{
  "value": "<base64-encoded-image>",
  "metadata": {
    "tokensUsed": 0,
    "cost": 0.04,
    "modelUsed": "dall-e-3"
  }
}

Example Request

cURL
Go SDK
Python SDK

curl -X POST http://localhost:2008/api/objectGen \
  -H "Authorization: Bearer your-api-token" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Generate an image",
    "definition": {
      "type": "object",
      "properties": {
        "artwork": {
          "type": "byte",
          "instruction": "A serene landscape with mountains and a lake at sunset",
          "image": {
            "model": "dall-e-3",
            "size": "1024x1024"
          }
        }
      }
    }
  }'

definition := jsonSchema.Definition{
    Type: jsonSchema.Object,
    Properties: map[string]jsonSchema.Definition{
        "artwork": {
            Type:        jsonSchema.Byte,
            Instruction: "A serene landscape with mountains and a lake at sunset",
            Image: &jsonSchema.ImageGen{
                Model: "dall-e-3",
                Size:  "1024x1024",
            },
        },
    },
}

definition = Definition(
    definition_type="object",
    properties={
        "artwork": Definition(
            definition_type="byte",
            instruction="A serene landscape with mountains and a lake at sunset",
            image=ImageGen(
                model="dall-e-3",
                size="1024x1024"
            )
        )
    }
)

Text-to-Speech (TTS)

Convert text to speech. Any provider-supported model and voice can be used by specifying their names.

Basic Usage

{
  "type": "byte",
  "textToSpeech": {
    "stringToAudio": "Hello, welcome to ObjectWeaver!",
    "voice": "alloy",
    "model": "tts-1",
    "speed": 1.0
  }
}

Configuration

Parameter	Type	Description	Default
`stringToAudio`	string	Text to convert to speech (required)	-
`voice`	string	Voice name (must match provider options)	-
`model`	string	Model name	`tts-1`
`speed`	number	Speaking rate (0.25 to 4.0)	`1.0`
`responseFormat`	string	Output format (e.g., `mp3`, `opus`)	`mp3`

Response Format

Returns base64-encoded audio data:

{
  "value": "<base64-encoded-audio>",
  "metadata": {
    "tokensUsed": 0,
    "cost": 0.015,
    "modelUsed": "tts-1"
  }
}

Example Request

cURL
Go SDK
Python SDK

curl -X POST http://localhost:2008/api/objectGen \
  -H "Authorization: Bearer your-api-token" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Create an audio greeting",
    "definition": {
      "type": "object",
      "properties": {
        "greeting": {
          "type": "byte",
          "textToSpeech": {
            "stringToAudio": "Welcome to our service!",
            "voice": "nova",
            "model": "tts-1-hd",
            "speed": 1.1
          }
        }
      }
    }
  }'

definition := jsonSchema.Definition{
    Type: jsonSchema.Object,
    Properties: map[string]jsonSchema.Definition{
        "greeting": {
            Type: jsonSchema.Byte,
            TextToSpeech: &jsonSchema.TextToSpeech{
                StringToAudio:  "Welcome to our service!",
                Voice:          "nova",
                Model:          "tts-1-hd",
                Speed:          1.1,
                ResponseFormat: "mp3",
            },
        },
    },
}

definition = Definition(
    definition_type="object",
    properties={
        "greeting": Definition(
            definition_type="byte",
            text_to_speech=TextToSpeech(
                string_to_audio="Welcome to our service!",
                voice="nova",
                model="tts-1-hd",
                speed=1.1,
                response_format="mp3"
            )
        )
    }
)

Speech-to-Text (STT)

Transcribe audio using provider models. Supports various output formats including verbose JSON with timestamps.

Basic Transcription

{
  "type": "byte",
  "speechToText": {
    "audioToTranscribe": "<base64-audio-data>",
    "language": "en",
    "model": "whisper-1",
    "toString": true
  }
}

Configuration

Parameter	Type	Description	Default
`audioToTranscribe`	string	Base64-encoded audio data (required)	-
`language`	string	ISO 639-1 language code	`en`
`model`	string	Model name	`whisper-1`
`responseFormat`	string	Output format (e.g., `text`, `verbose_json`)	`text`
`toString`	boolean	Return plain text format	`true`
`toCaptions`	boolean	Return SRT caption format	`false`
`prompt`	string	Context to improve accuracy	-

Response Formats

ObjectWeaver supports standard provider response formats such as text, json, srt, vtt, and verbose_json. Refer to your provider's documentation for details on each format's structure and capabilities.

Basic Response Format

For simple formats (text, json):

{
  "value": "This is the transcribed text.",
  "metadata": {
    "tokensUsed": 0,
    "cost": 0.006,
    "modelUsed": "whisper-1"
  }
}

Verbose JSON Output

When using verbose_json or diarized_json response formats, the transcription includes detailed metadata with segment-level timing, word-level timestamps, and quality metrics.

Enabling Verbose Output

{
  "type": "byte",
  "speechToText": {
    "audioToTranscribe": "<base64-audio-data>",
    "language": "en",
    "model": "whisper-1",
    "responseFormat": "verbose_json"
  }
}

Response Structure

{
  "value": "Complete transcribed text of the audio.",
  "metadata": {
    "tokensUsed": 0,
    "cost": 0.006,
    "modelUsed": "whisper-1",
    "verboseData": {
      "language": "en",
      "duration": 15.5,
      "segments": [...],
      "words": [...]
    }
  }
}

VerboseData Structure

The verboseData object contains rich metadata about the transcription:

Top-Level Fields

Field	Type	Description
`language`	string	Detected or specified language code (ISO 639-1)
`duration`	number	Total audio duration in seconds
`segments`	array	Array of transcription segments with timing and metadata
`words`	array	Array of individual word timestamps

Segment Structure

Each segment includes detailed timing and quality metrics:

{
  "id": 0,
  "seek": 0,
  "start": 0.0,
  "end": 3.5,
  "text": "This is the first segment.",
  "tokens": [1234, 5678, 9012],
  "temperature": 0.0,
  "avg_logprob": -0.25,
  "compression_ratio": 1.2,
  "no_speech_prob": 0.01,
  "transient": false
}

Field	Type	Description
`id`	integer	Segment identifier (sequential)
`seek`	integer	Seek position in the audio
`start`	number	Segment start time in seconds
`end`	number	Segment end time in seconds
`text`	string	Transcribed text for this segment
`tokens`	array	Token IDs used in transcription
`temperature`	number	Temperature used for generation
`avg_logprob`	number	Average log probability (quality indicator)
`compression_ratio`	number	Text compression ratio
`no_speech_prob`	number	Probability segment contains no speech
`transient`	boolean	Whether segment is transient/temporary

Word Structure

Each word includes precise timing information:

{
  "word": "transcribed",
  "start": 0.0,
  "end": 0.5
}

Field	Type	Description
`word`	string	The individual word text
`start`	number	Word start time in seconds
`end`	number	Word end time in seconds

Complete Example

cURL
Response

curl -X POST http://localhost:2008/api/objectGen \
  -H "Authorization: Bearer your-api-token" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Transcribe with detailed metadata",
    "definition": {
      "type": "object",
      "properties": {
        "detailedTranscription": {
          "type": "byte",
          "speechToText": {
            "audioToTranscribe": "<base64-encoded-audio>",
            "language": "en",
            "model": "whisper-1",
            "responseFormat": "verbose_json"
          }
        }
      }
    }
  }'

{
  "data": {
    "detailedTranscription": "Complete transcription text here."
  },
  "detailedData": {
    "detailedTranscription": {
      "value": "Complete transcription text here.",
      "metadata": {
        "tokensUsed": 0,
        "cost": 0.006,
        "modelUsed": "whisper-1",
        "verboseData": {
          "language": "en",
          "duration": 15.5,
          "segments": [
            {
              "id": 0,
              "start": 0.0,
              "end": 3.5,
              "text": "First segment of speech.",
              "tokens": [1234, 5678],
              "temperature": 0.0,
              "avg_logprob": -0.25,
              "compression_ratio": 1.2,
              "no_speech_prob": 0.01
            },
            {
              "id": 1,
              "start": 3.5,
              "end": 7.2,
              "text": "Second segment of speech.",
              "tokens": [9012, 3456],
              "temperature": 0.0,
              "avg_logprob": -0.22,
              "compression_ratio": 1.3,
              "no_speech_prob": 0.02
            }
          ],
          "words": [
            {"word": "First", "start": 0.0, "end": 0.5},
            {"word": "segment", "start": 0.5, "end": 1.0},
            {"word": "of", "start": 1.0, "end": 1.2},
            {"word": "speech", "start": 1.2, "end": 1.8}
          ]
        }
      }
    }
  },
  "metadata": {
    "tokensUsed": 0,
    "cost": 0.006,
    "duration": 1234567890,
    "fieldCount": 1
  }
}

Use Cases for Verbose Data

The verbose metadata enables advanced use cases:

Subtitle Generation: Use segment timing to create accurate subtitles with precise timestamps
Word-Level Synchronization: Sync visual elements or animations with specific words in the audio
Quality Assessment: Analyze avg_logprob and no_speech_prob to assess transcription confidence and reliability
Speaker Analysis: With diarized_json, identify and separate different speakers in conversations
Audio Editing: Use timestamps to programmatically edit or segment audio files based on content
Accessibility: Create enhanced captions with precise timing for improved accessibility
Content Analysis: Analyze speech patterns, pauses, and segment structure for content insights

Error Handling

Common errors include missing API keys, invalid base64 data, or provider rate limits.

Error Response

{
  "error": "descriptive error message",
  "metadata": {
    "tokensUsed": 0,
    "cost": 0,
    "modelUsed": "model-name"
  }
}

Best Practices

Image Generation: Use descriptive prompts and select appropriate aspect ratios.
Text-to-Speech: Select voices that match your application's tone.
Speech-to-Text: Use verbose_json for timing data and provide a prompt for context to improve accuracy.
Performance: Process byte operations in parallel with other fields and cache results where possible.

Image Generation​

Basic Usage​

Response Format​

Example Request​

Text-to-Speech (TTS)​

Basic Usage​

Configuration​

Response Format​

Example Request​

Speech-to-Text (STT)​

Basic Transcription​

Configuration​

Response Formats​

Basic Response Format​

Verbose JSON Output​

Enabling Verbose Output​

Response Structure​

VerboseData Structure​

Top-Level Fields​

Segment Structure​

Word Structure​

Complete Example​

Use Cases for Verbose Data​

Error Handling​

Error Response​

Best Practices​

Related Features​

Image Generation

Basic Usage

Response Format

Example Request

Text-to-Speech (TTS)

Basic Usage

Configuration

Response Format

Example Request

Speech-to-Text (STT)

Basic Transcription

Configuration

Response Formats

Basic Response Format

Verbose JSON Output

Enabling Verbose Output

Response Structure

VerboseData Structure

Top-Level Fields

Segment Structure

Word Structure

Complete Example

Use Cases for Verbose Data

Error Handling

Error Response

Best Practices

Related Features