AI for Learning

Summarize YouTube Videos with AI

Using Whisper and GPT4 to transcribe and summarize videos

Benedict Neo
bitgrit Data Science Publication
6 min readApr 4, 2024

--

Photo by Javier Miranda on Unsplash

I love watching YouTube videos to learn just about anything.

For educational and informational videos, I like to extract key points and details I want to remember later.

Since this is a perfect use case for LLMs, I spent a few hours digging through the web to figure out a method and started coding it.

In this article, we’ll explore how to use Python to download the audio from YouTube videos, transcribe them with Whisper API, and then use GPT 4 to summarize them.

You can find the code in this Deepnote Notebook or GitHub repo.

Download YouTube

what is yt-dlp?

yt-dlp is a command-line program to download videos from YouTube. It’s a fork of youtube-dlc, which is a fork of youtube-dl, enriched with additional features and fixes.

downloading audio

Here, we download the worst-quality audio from YouTube to minimize bandwidth and storage.

We fetch the audio in MP3 format.

We set the preferred quality to "96", representing a specific bitrate quality in kilobits per second (kbps) we target for the extracted audio file.

In audio files, bitrate directly influences the audio quality and file size; a higher bitrate generally means better audio quality but results in a larger file.

So here we’re for a moderate quality that balances sound quality and file size.

We also include a progress_hook to return the filename after the download is done.

Converting to mono audio

To further reduce the file size, we convert it to mono — combining multiple audio channels into one, simplifying the audio data without significantly compromising the speech in the audio tracks.

The “FFmpeg bitrate” refers to the target bitrate for audio compression when converting the audio to mono using FFmpeg commands.

The “32k” setting specifies that the audio should be compressed to a bitrate of 32 kilobits per second. This is a relatively low bitrate, reflecting a significant compression that reduces file size at the expense of some loss in audio quality.

Whisper Transcription

OpenAI’s Whisper model offers robust capabilities for audio transcription, transforming speech into text with remarkable accuracy.

We use OpenAI’s client to submit an audio file to the Whisper model.

We read in the mono file that is created in our convert_audio_to_mono function

Summarize

Now for the star of the show, the prompts.

System prompt

Your task is to deliver an in-depth analysis of video transcripts, offering nuanced insights that are easily digestible. Focus on a detailed exploration of the content, with particular emphasis on explaining terminology and providing a thorough section-by-section summary.

User Prompt

I like reading Paul Graham's essays, so I made GPT-4 write in his style. I also prefer to have terminologies and practical takeaways, so I included all these elements in the essay. I also added an ELI5 section that is useful when learning anything new.

Your task is to provide an in-depth analysis of a provided video transcript, structured to both inform and engage readers. Your narrative should unfold with clarity and insight, reflecting the style of a Paul Graham essay. Follow these major headings for organization:

# Intro

Begin with a narrative introduction that captivates the reader, setting the stage for an engaging exploration of bilingualism. Start with an anecdote or a surprising fact to draw in the reader, then succinctly summarize the main themes and objectives of the video.

# ELI5

Immediately follow with an ELI5 (Explain Like I’m 5) section. Use simple language and analogies to make complex ideas accessible and engaging, ensuring clarity and simplicity.

# Terminologies

- List and define key terminologies mentioned in the video in bullet points. Provide comprehensive yet understandable definitions for someone not familiar with the subject matter. Ensure this section naturally transitions from the ELI5, enriching the reader’s understanding without overwhelming them.

# Summary

Your summary should unfold as a detailed and engaging narrative essay, deeply exploring the content of the video. This section is the core of your analysis and should be both informative and thought-provoking. When crafting your summary, delve deeply into the video’s main themes. Provide a comprehensive analysis of each theme, backed by examples from the video and relevant research in the field. This section should read as a compelling essay, rich in detail and analysis, that not only informs the reader but also stimulates a deeper consideration of the topic’s nuances and complexities. Strive for a narrative that is as enriching and engaging as it is enlightening. Please include headings and subheadings to organize your analysis effectively if needed. It should be as detailed and comprehensive as possible.

# Takeaways

- End with actionable takeaways in bullet points, offering practical advice or steps based on the video content. These should relate directly to the insights discussed in your essay and highlight their real-world relevance and impact.

\n\n\nText: {}:

OpenAI call

With these prompts, we write a simple function to call the chat completion API.

Now, with all these elements, we can write our main script.

Main script

Here, the user input is processed, and the orchestration of functions unfolds.

The argparse module lets us work with user input from the CLI.

We save both the transcript and summary files for reference

And clean up the audio files after it’s all done.

Example

Here’s a run-through of this exciting video on Bilingualism.

We run using the command python main.py 'URL'

Here, we see it downloading the video and extracting the audio.

Then, Whisper streams that audio and transcribe it.

Here’s the transcript in all its glory.

Let’s take a look at the summary.

That’s it for this article!

A significant limitation of this code is that the maximum file size it can work with is 25 MB. Based on a calculation by ChatGPT it says the max is a video that’s 1 hour 46 minutes long.

I haven’t tested whether this is true, but a 2-hour video broke the Whisper transcription.

Let me know if you can figure out a workaround for longer videos or if you have ideas on better prompts or free, open-source alternatives to summarizeing YouTube videos in the comments below!

Thanks for reading!

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--