Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 84 additions & 9 deletions docs/docs/natural-language-processing/useSpeechToText.md

Large diffs are not rendered by default.

21 changes: 14 additions & 7 deletions docs/docs/typescript-api/SpeechToTextModule.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,27 @@ const transcribedText = await SpeechToTextModule.transcribe(waveform);

### Methods

| Method | Type | Description |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `load` | <code>(modelName: 'whisper' &#124 'moonshine' &#124 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
| `transcribe` | `(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
| `encode` | `(waveform: number[]) => Promise<number[]>` | Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
| `decode` | `(tokens: number[], encodings?: number[]) => Promise<number[]>` | Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
| `configureStreaming` | <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' &#124; 'balanced' &#124; 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |
| Method | Type | Description |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `load` | <code>(modelName: 'whisper' &#124 'moonshine' &#124 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
| `transcribe` | `(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
| `streamingTranscribe` | `(streamingAction: STREAMING_ACTION, waveform?: number[], audioLanguage?: SpeechToTextLanguage) => Promise<string>` | This allows for running transcription process on-line, which means where the whole audio is not known beforehand i.e. when transcribing from a live microphone feed. `streamingAction` defines the type of package sent to the model: <li>`START` - initializes the process, allows for optional `waveform` data</li><li>`DATA` - this package should contain consecutive audio data chunks sampled in 16k Hz</li><li>`STOP` - the last data chunk for this transcription, ends the transcription process and flushes internal buffers</li> Each call returns most recent transcription. Returns error when called when module is in use (i.e. processing `transcribe` call) |
| `encode` | `(waveform: number[]) => Promise<number[]>` | Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
| `decode` | `(tokens: number[], encodings?: number[]) => Promise<number[]>` | Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
| `configureStreaming` | <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' &#124; 'balanced' &#124; 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |

<details>
<summary>Type definitions</summary>

```typescript
type ResourceSource = string | number | object;

enum STREAMING_ACTION {
START,
DATA,
STOP,
}

enum SpeechToTextLanguage {
Afrikaans = 'af',
Albanian = 'sq',
Expand Down
7 changes: 6 additions & 1 deletion examples/llm/App.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ import {
} from 'react-native';
import LLMScreen from './screens/LLMScreen';
import LLMToolCallingScreen from './screens/LLMToolCallingScreen';
import VoiceChatScreen from './screens/VocieChatScreen';

enum Mode {
LLM,
LLM_VOICE_CHAT,
LLM_TOOL_CALLING,
}

Expand All @@ -39,6 +41,9 @@ export default function App() {
case Mode.LLM:
return <LLMScreen setIsGenerating={setIsGenerating} />;

case Mode.LLM_VOICE_CHAT:
return <VoiceChatScreen setIsGenerating={setIsGenerating} />;

case Mode.LLM_TOOL_CALLING:
return <LLMToolCallingScreen setIsGenerating={setIsGenerating} />;

Expand All @@ -61,7 +66,7 @@ export default function App() {
{!isGenerating ? (
<View style={styles.wheelPickerContainer}>
<ScrollPicker
dataSource={['Chat with LLM', 'Tool calling']}
dataSource={['Chat with LLM', 'Talk to LLM', 'Tool calling']}
onValueChange={(_, selectedIndex) => {
handleModeChange(selectedIndex);
}}
Expand Down
4 changes: 4 additions & 0 deletions examples/llm/assets/icons/mic_icon.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions examples/llm/assets/icons/stop_icon.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading