RVMedia 12 – FFmpeg 8, speech to text (Whisper)

RVMedia 12.0 has been released.

The trial version can be downloaded from https://www.trichview.com/download/

The full version: can be found in the protected section of the forum. This update is free for customers who ordered/renewed RVMedia in 2024-2026, and for customers with RVMedia subscription.

Main new features after the last released version (v11.2):

support for FFmpeg 8
speech recognition using Whisper AI model, included in FFmpeg 8
more efficient decoding of video frames from local cameras on Windows and Linux
tweaks and fixes (described in the protected section of this forum)

A reminder of other changes made after version 11.0:

UDP streaming with FFmpeg
remuxing (remultiplexing – saving video without changing formats of audio and video streams) or streaming in two destinations at the same time
advanced error handling
support for Delphi and C++Builder 13
support for 64-bit Delphi IDE

See the RVMedia version history: https://www.trichview.com/help-media/version_history.htm

FFmpeg 8

This release adds support for FFmpeg 8.
RVMedia now supports FFmpeg versions 1 through 8.

The RVMedia installation now includes an FFmpeg 8.1.1 build for Windows 64-bit with Whisper support (see below). This build is compatible with the LGPL license.
Options that require the GPL license have been removed, since they can only be used in open-source applications.

Speech to text

Whisper: AI speech recognition

Whisper is an free open-source speech recognition and transcription AI model developed by OpenAI. It is designed to convert spoken language into text.
RVMedia can use a Whisper version integrated in FFmpeg 8+.
The Whisper code is included in FFmpeg. However, a model file is also required.

Speech-to-text conversion is performed entirely on the user’s computer and does not require any online services or API keys. All that is needed is a speech recognition model file.
The RVMedia installation includes the smallest available English-only model. While it is not very suitable for real-world use, it allows you to test speech recognition functionality and can run even on relatively low-end computers.

Additional models can be downloaded here: https://huggingface.co/ggerganov/whisper.cpp/tree/main.

Larger model files provide better recognition accuracy, but they also require more powerful hardware. Ideally, the user should have a modern high-performance GPU. However, even without a GPU, the smaller models can be used on the CPU.

The available model files are divided into:

English-only models (their filenames contain “en”),
multilingual models, which support many languages.

Voice Activity Detection (Optional)

In addition to the main models that perform speech recognition, FFmpeg can optionally use VAD (Voice Activity Detection) AI models.

These models detect when speech starts and ends in the audio stream, allowing the main recognition model to run only when necessary. This provides two important benefits:

more efficient use of CPU/GPU resources;
reduced risk of recognizing noise as speech (so-called hallucinations of the speech recognition model). Unfortunately, Whisper is prone to this problem, especially when using multilingual models.

The drawback of this approach is that it requires significantly more audio to be buffered before recognition can begin. As a result, recognized text becomes available with greater latency.

Speech to text conversion in RVMedia

Speech recognition is integrated into RVMedia in two places.

First, the TRVCamera component can perform speech recognition when it receives video with audio using FFmpeg (note that this requires FFmpeg 8 or later built with Whisper support). In this case, speech recognition runs simultaneously with receiving video. It can be enabled or disabled at any time while the video is being received.
Speech recognition settings are available in TRVCamera.FFmpegProperty.SpeechToText: TRVFFmpegSpeechToTextProperty . Recognized text is returned through the TRVCamera.OnSpeechRecognized event.

Second, speech recognition is available in the TRVAudioPlayer component, in addition to its audio playback and recording capabilities. In this case, the audio data may come from any RVMedia audio source, including:

TRVMicrophone (a microphone or other audio input device),
TRVCamera + TRVCamSound (video with audio),
TRVCamReceiver (audio received over the network).

In TRVAudioPlayer, speech recognition works independently of audio playback, but is tied to the recording functionality. If TRVAudioPlayer is recording audio to a file, speech recognition can be enabled in addition to recording. However, the component can also perform speech recognition without recording audio to a file.
Speech recognition settings are available in TRVAudioPlayer.SpeechToTextProperty: TRVFFmpegSpeechToTextProperty . Recognized text is returned through the TRVAudioPlayer.OnSpeechRecognized event.

Local (USB) cameras

This update significantly optimizes the decoding of frames received from local webcams. This applies to Windows and Linux, where RVMedia performs frame decoding itself. (On macOS, RVMedia uses the operating system’s built-in decoding facilities.) As a result, CPU usage is significantly reduced, and in some cases a higher frame rate can be achieved.

Support for MJPEG modes of local cameras has also been added on Linux. These modes are typically more efficient than other camera formats.

Demo projects

Demos\Recording\SpeechToText\

A new speech recognition demo has been added (in 3 versions: for VCL, for Lazarus, for FireMonkey).

List of sample cameras

The list of public cameras used in many demo projects has been updated. Non-working cameras have been removed, and new cameras have been added.

Compiled RVMedia demo projects (VCL for Windows) can be downloaded from https://www.trichview.com/download/mediademo.html

Sergey Tkachenko

Previous « TRichView 24.1.3 - Copying and pasting images

Published by

Sergey Tkachenko

2 months ago

TRichView 24.1.3 – Copying and pasting images
ReportWorkshop 7.0 – charts, SVG shapes

TRichView 24.1.3 – Copying and pasting images

This update improves clipboard image support in the Windows version of TRichView. Previously, the components…

3 months ago

ReportWorkshop 7.0 – charts, SVG shapes

We are pleased to announce a new release of ReportWorkshop. This update is free for…

4 months ago

Announcement

TRichView 24.0.3 – New images for dialogs

RichView 24.0.3 includes a new set of images for dialog boxes of RichViewActions and ScaleRichView…

8 months ago

Announcement

TRichView 24.0.2 – new background dialogs

This update completes the work on the new background definition system (well, almost).The new properties…