─── 🖊⋅🖊⋅🖊 ───

Happy new year! I’m trying to proof-of-concept something: let’s say there’s a series of technical talks, quality content, and they’re all recorded and posted online. But they get low watch time. What are ways in which we can reformat the content to make it easier to consume in a fast-moving world?

Idea 1: use AI to quickly generate multiple shorter clips

The goal is that a viewer can watch some clips casually, and decide to watch more if they have time and interest. Ideally, it’d be a video that is edited across multiple moments in a talk, but I’ll just start with clipping out continuous 30s or 1m of content.

After some initial brainstorming with Perplexity, the goal workflow is as such:

  1. download video
  2. extract audio with ffmpeg
  3. put the audio through whisperx to extract timestamps in a .srf file (turns out this is the standard subtitles format!)
  4. Put the .srf into AI to identify insightful portions to clip out
  5. Use ffmpeg to clip out those portions based on the timestamp

Licensing

Some important questions to ask:

FFmpeg

What a useful tool! I used it to extract my audio into an mp3 file, and used a simple command to clip out various timestamps. I also even added a nifty little fade-out effect in the last 2 seconds, in both video and audio. See below.

ffmpeg -ss 00:44:25.791 -to 00:45:42.936 -i bio-lecture.mp4   -vf "fade=t=out:st=75:d=2"   -af "afade=t=out:st=75:d=2"   sugars.mp4

whisperx

I’m on a RTX 1060 6gb. Which is old, and weak. I spent a lot of time fiddling around with version and stuff. To get whisperx to work (WhisperX: Automatic Speech Recognition with Word-level Timestamps), I had to download these things:

  • CUDA
  • cuDNN
  • whisperx
  • ctranslate2
  • torch, torchaudio, torchvision
  • libcudn

I learned that if in doubt, just delete your virtual environment and do a clean install. Also, WSL is so annoying.

There are some interesting options you can put into whisperX. I ran it like this:

whisperx audio.wav   --model large-v2   --batch_size 1   --compute_type int8 --language en --vad_method silero --device cpu
  • gave up eventually on GPU so I swapped to CPU compute
  • large model seems to be ok
  • keep a low batch_size b/c weak GPU
  • compute_type also can only be int8. The next upgrade would be float16.
  • silero vad is a voice activity detector. cool! basically, it detects breaks in sentences and silence.

While running whisperX, the terminal output was giving the transcription in 30 second chunks (btw, runtime wasn’t too bad: I got a chunk every 5s or so), so I was worried that the transcription was not fine-grained enough. Thankfully, after transcribing, it takes your content and spends some time doing alignment, so the final result is subtitles that are timestamped to approximately a sentence or two each, which is much more natural. In the future, I’ll try throwing the subtitles on top of the video again, so I can have a closed captions video.

Idea 2: clip out the audio and post it as a podcast

Less work to do here, but the idea is to use ffmpeg to get the audio again, and then post it on a podcast-type of website. Then people can listen to it more passively, rather than watching the whole video. Hmm… who listens to podcasts these days? I do.

This one is pretty good: GitHub - ad-aures/castopod: Castopod is an open-source hosting platform made for podcasters who want engage and interact with their audience. Synchronized read-only mirror of https://code.castopod.org/adaures/castopod

Misc other ideas

  • better website with a link to directly sign up for the mailing list, sort past content by tags, and make it searchable. I mean, if you’re going to keep doing this for years, you’ve got to have a robust way to store everything!
  • networking corner? e.g. if the speaker is a startup-y guy, have a startup networking corner. If it’s more like bio, have a bio networking corner.

One last thing you can do with ffmpeg

Make a slideshow/gif!

ffmpeg -f concat -safe 0 -i list.txt -r 30 -c:v libx264 -pix_fmt yuv420p output.mp4

You can define a file list.txt that basically is like this:

file 'photo1.jpg'
duration 0.3
file 'photo2.jpg'
duration 0.3
file 'photo3.jpg'
duration 0.3

And it’ll make your video for you.

─── 🖊⋅🖊⋅🖊 ───