Skip to main content

Overview

The Voice API manages voice samples and generates voice embeddings for text-to-speech. Upload a voice recording, and the service automatically splits it into samples and creates embeddings via a processing pipeline.

Typical Flow

  1. Upload a voice recording to create a new voice
  2. The recording is automatically split into opus-encoded samples
  3. Embedding is generated from the samples via RunPod
  4. Retrieve test samples and quality scores to verify the result
  5. Download the embedding zip or individual samples as needed

Features

  • Automatic processing: Upload an audio file and splitting + embedding run automatically
  • Manual sample upload: Create an empty voice and upload individual samples
  • Re-splitting: Re-process the original recording with custom parameters (sample count, duration)
  • Quality scoring: Each sample receives a similarity score after embedding
  • Test samples: TTS-generated test audio to preview voice quality
  • Sample bundles: Download all samples as a zip archive with metadata
Important Considerations
  • Audio conversion: All uploads are automatically converted to opus 96kbps mono
  • Processing locks: Voices cannot be modified while splitting or embedding is in progress
  • Rate limits: Requests are rate-limited per API key