TRVFFmpegSpeechToTextProperty.VADModelFileName, VADThreshold, VADMinSpeechDuration, VADMinSilenceDuration

<< Click to display table of contents >>

TRVFFmpegSpeechToTextProperty.VADModelFileName, VADThreshold, VADMinSpeechDuration, VADMinSilenceDuration

Properties that control VAD (voice activity detection).

property VADModelFileName: TFileName;
property VADThreshold: Cardinal;
property VADMinSpeechDuration: Cardinal;
property VADMinSilenceDuration: Cardinal;

VADModelFileName is the path to the VAD model file. If set, an additional voice activity detection module will be used.

VAD models can be downloaded from https://huggingface.co/ggml-org/whisper-vad/tree/main. More info: https://github.com/snakers4/silero-vad.

VAD models are used to detect segments of audio that contain speech and run speech recognition only on those segments. As a result, they provide two main benefits. First, they reduce CPU/GPU workload. Second, they help prevent speech recognition model hallucinations, where the model may generate phrases that are not actually present in the audio when the input consists mostly of noise.

On the other hand, using VAD models requires accumulating a significantly larger amount of audio before processing (larger values for the BufferDuration property, such as 20000). This increases the latency before recognized text becomes available.

VADThreshold is a VAD threshold to use, in range from 0 to 100.

VADMinSpeechDuration is the minimum VAD speaking duration, milliseconds (min value 20).

VADMinSilenceDuration is the minimum VAD silence duration, milliseconds (min value 0).

If the values of these properties are changed during a speech recognition session, the new values are not used in that session. They will be used the next time speech recognition is run. See Active.

Default values:

VADModelFileName: '' (empty string)

VADThreshold: 50

VADMinSpeechDuration: 100

VADMinSilenceDuration: 500