RSS Feed

YouTube automatic captions to .srt subtitle format

Posted on

If you know how to download the video from YouTube, then you may like to download the automatic captions (in English) as the subtitle. The automatic captions unlike the “closed captions”, “closed captions” can be downloaded using the userscript such as Download YouTube Captions. With the script, we can download the captions as the .srt subtitle format.

However, automatic captions is different. It is created by YouTube based on speech recognition, thus the captions are not very accurate. But I personally feel that it may be a little useful. Therefore, I have done some scripting to solve the problem semi-manually. Semi-manual is because the preparation of the subtitle have to do it manually. I do not spend time to write a userscript to solve it.

In order to convert the automatic captions into the .srt,

  1. Go to the YouTube page of our interested video.
  2. Click the “Transcript” icon which is besides “About”, “Share”, and “Add to”. This will show a frame of English (automatic captions). Now we are going to copy all the text in the frame.
  3. Using web browser’s Inspector by right-click. Then choose the parent HTML element of these captions, then right-click the element to “Copy Inner HTML” and save to a plain text file as HTML file format.
  4. Open the HTML file format with the web browser, this will show the time and subtitles for every two lines. Copy these text to another plain text file.
  5. Finally, use the Perl script below to convert the plain text file.
#!/usr/bin/perl
# Download the auto generated caption (English) from the internet, convert to the text.
# Then this script is to convert the text into the .srt format
use strict;
use warnings;
 
my $file = $ARGV[0];
 
my @time,my @subtitles;
 
open(FILE,$file);
while(<FILE>) {
    my $line = $_;
    $line =~ s/^\s+|\s+$//g;
    if($line =~ /^(\d+:\d+)/) { #Updated (thanks to Daniel)
        push @time,$1;
    }
    elsif($line =~ /(.+)/) {
        if(length($1)) {
            push @subtitles,$1;
        }
    }
}
close(FILE);
 
for(my $i=0;$i<@subtitles-1;$i++) {
    print "00:$time[$i],000 --> 00:$time[$i+1],000\n";
    print $subtitles[$i],"\n\n";
}
 
my $next = $time[@subtitles-1];
if($next =~ /((\d+):(\d+))/) {
    my $temp = $3+5;
    $next = "$2:$temp";
}
 
print "00:$time[@subtitles-1],000 --> 00:$next,000\n";
print $subtitles[@subtitles-1],"\n";

The steps 1-4 are done manually. It is possible to convert the above steps using JavaScript (userscript). But it is too time consuming for me.

About Allen Choong

A cognitive science student, a programmer, a philosopher, a Catholic.

3 responses »

  1. Thanks a lot man! I used your script in my linux distribution to obtain a srt file. I changed your code a little bit, like this:
    …………………………….
    close(FILE);
    open(FILE,”>>subtitle.srt”);
    for(my $i=0;$i 00:$time[$i+1],000\n”;
    print FILE $subtitles[$i],”\n\n”;
    }

    my $next = $time[@subtitles-1];
    if($next =~ /((\d+):(\d+))/) {
    my $temp = $3+5;
    $next = “$2:$temp”;
    }

    print FILE “00:$time[@subtitles-1],000 –> 00:$next,000\n”;
    print FILE $subtitles[@subtitles-1],”\n”;
    close(FILE);

    Once again, thanks!

    Reply
    • Thanks for your sharing too. Feel free to modify the script.

      Reply
      • There is one more thing what can be done with your script to be more reliable.
        I think this line should be like this:

        Instead: if($line =~ /(\d+:\d+)/) we should have:
        if($line =~ /^(\d+:\d+)/)
        Becouse the variable will memoryze every timing even if it is inside a subtitle which can happen🙂. This way we make sure it takes only the timing from the begining of the line, becouse debugging the timing from a subtitle it’s much easier.
        Personaly I am a java programer, but a perl script it’s easier than java so…
        Thanks again for your code!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: