If you create videos for social media (or just consume them), you may have noticed that captions make the content more engaging and increase the audience’s focus and attention. Myself, even though I am a rather advanced English speaker, I still prefer to watch subtitled movies and TV shows. When OpenAI announced Whisper I had to give it a try and incorporate it in some project. My first idea was to create an automatic translator but that was too complicated, so I went for the next best thing I’d have more frequent use for: Captions for my videos.
Phonix is a Python program that uses OpenAI’s API
to generate captions for videos.
It uses Whisper for the text-to-speech part and comes with a CLI as well as an easy-to-use GUI.
For the GUI, I gave PySimpleGUI a try and was
impressed since it was rather easy to get going, even though some parts of its API weren’t
The idea is that you select a video file, provide your OpenAI API key, select the language, the format of the captions, optionally provide a prompt to give Whisper some context and finally, choose whether you want to transcribe or translate the video.
ffmpeg (which you may need to install separately)
extract the audio from the video and downsample it if it’s over 25 MB, which is the
maximum size the Whisper API currently allows.
Compared to Youtube’s and Linkedin’s automatic captions, my very subjective opinion is that
Whisper is more accurate, especially with jargon and technical terms.
That is if you provide a prompt for some context. A simple sentence that
describes the video’s topic and potentially mentions key terms is enough to
make a difference.
To try Phonix out you will need to get an OpenAI API key and install the various dependencies. I’ve tried it on both Windows and Ubuntu.
Check out a demo: