Phonix: AI-powered video captions

08 May 2023

a robot transcribing a human

If you create videos for social media (or just consume them), you may have noticed that captions make the content more engaging and increase the audience’s focus and attention. Myself, even though I am a rather advanced English speaker, I still prefer to watch subtitled movies and TV shows. When OpenAI announced Whisper I had to give it a try and incorporate it in some project. My first idea was to create an automatic translator but that was too complicated, so I went for the next best thing I’d have more frequent use for: Captions for my videos.

Phonix is a Python program that uses OpenAI’s API to generate captions for videos. It uses Whisper for the text-to-speech part and comes with a CLI as well as an easy-to-use GUI. For the GUI, I gave PySimpleGUI a try and was impressed since it was rather easy to get going, even though some parts of its API weren’t very intuitive.
The idea is that you select a video file, provide your OpenAI API key, select the language, the format of the captions, optionally provide a prompt to give Whisper some context and finally, choose whether you want to transcribe or translate the video.
pydub and ffmpeg (which you may need to install separately) extract the audio from the video and downsample it if it’s over 25 MB, which is the maximum size the Whisper API currently allows.

Compared to Youtube’s and Linkedin’s automatic captions, my very subjective opinion is that Whisper is more accurate, especially with jargon and technical terms. That is if you provide a prompt for some context. A simple sentence that describes the video’s topic and potentially mentions key terms is enough to make a difference.
To try Phonix out you will need to get an OpenAI API key and install the various dependencies. I’ve tried it on both Windows and Ubuntu.

Check out a demo: