VobSubs are subtitles on DVDs, which are pictures overlaid on the video image.

Extract VOBSUBs using mencoder. Will create dvd.idx and dvd.sub.

mencoder dvd://1 -ovc copy -oac copy -vobsubout dvd -vobsuboutindex 0 -sid 0 -o /dev/null

Merge them all into a Matroska file:

mkvmerge -o dvd.mkv dvd.mp4 dvd.idx dvd.sub

Detecting VobSubs

Note that ffmpeg and libav v 0.8.* will see the vobsubs (Stream #0.0) and closed captioning (part of MPEG2 video in Stream #0.1) with ffprobe or avprobe, but any other higher version of avprobe (libav) will not.

$ ffprobe dvd_track_02.vob
ffprobe version 3.3.3 Copyright (c) 2007-2017 the FFmpeg developers
  built with gcc 4.9.4 (Gentoo 4.9.4 p1.0, pie-0.6.4)
  configuration: --prefix=/usr/local/ffmpeg
  libavutil      55. 58.100 / 55. 58.100
  libavcodec     57. 89.100 / 57. 89.100
  libavformat    57. 71.100 / 57. 71.100
  libavdevice    57.  6.100 / 57.  6.100
  libavfilter     6. 82.100 /  6. 82.100
  libswscale      4.  6.100 /  4.  6.100
  libswresample   2.  7.100 /  2.  7.100
Input #0, mpeg, from 'dvd_track_02.vob':
  Duration: 00:06:29.73, start: 441.272633, bitrate: 5630 kb/s
    Stream #0:0[0x1bf]: Data: dvd_nav_packet
    Stream #0:1[0x1e0]: Video: mpeg2video (Main), yuv420p(tv, smpte170m, bottom first), 720x480 [SAR 8:9 DAR 4:3], Closed Captions, 29.97 fps, 59.94 tbr, 90k tbn, 59.94 tbc
    Stream #0:2[0x80]: Audio: ac3, 48000 Hz, mono, fltp, 192 kb/s

archives: VobSub notes

Converting VobSubs is really hard.

First, extract them using transcode and subtitle2pgm (subtitleripper package):

tcextract -x ps1 -t vob -a 0x20 -i ../DC_Reader.vob | subtitle2pgm -o english -c 

You can find the right color codes to use by playing with the options, to make OCR easier. See http://www.bunkus.org/dvdripping4linux/en/separate/subtitles.html#subtitles for more details.

I *vaguely* recall having issues with newer (>0.45) versions of gocr, but it could just have been that it didn't fare any better.

Use pgm2txt to use OCR on the image files:

pgm2txt english

If you did the color conversion right, it should find most of them itself.