The History and Cause of the DVD Audio Delay Myth

There is no Audio Delay!

Unless the DVD has an authoring problem the video and audio will be perfectly in sync. So why do so many programs report a delay? Simple, they do not demultiplex the streams properly. If properly demultiplexed you should end up with the same elementary streams that were used to author the DVD and no delay unless the demultiplexing is not done on the original asset boundaries. There are two culprits.

DVD2AVI, MPEG2DEC3, and Closed GOPs

GOP stands for "Group Of Pictures", a basic encoding unit, and they come in two types - open and closed. More about that later. The pictures contained within a GOP come in three types. The I picture is fully contained and needs no other information to be decoded, this is why they are also known as key frames, although that is an avi term, and they really are pictures, not frames. Another type is the P picture which uses the previously decoded picture as a reference and modifies it. Both I and P pictures become reference pictures, that is the each P picture modifies the picture produced by the last I or P picture data. Then there is the B picture which references the two previous reference pictures. Note that by previous I mean in the encoded stream, the pictures will be displayed in a different order, and the two reference pictures used by a B picture will be temporaly before and after the B picture when played back.

Now back to GOP types. GOPs always begin with an I picture, however it is not always the first picture displayed. In an open GOP the first B pictures can use the last reference picture of the previous GOP. They don't have to, but if they do it means making an edit at that point will break the decoding of the B pictures. These GOPs have a picture order of IBBPBB... Open GOPs can also be self contained with a picture order of IPBBPBB - here the I and first P are the reference pictures for the B pictures.

A closed GOP has the picture order of IBBPBB, just like the first type of open GOP, except that the encoder has not used the previous GOP's last reference frame. The GOP can be decoded all by itself, and editing will not break the decoding.

Now enters DVD2AVI and MPEG2DEC3. Not fully understanding closed GOPs the author thought that a picture order of IBBPBB required a previous GOP to decode properly. So if the first GOP was a closed GOP MPEG2DEC would discard the B pictures, acting as if they could not be decoded properly. This led to a negative delay (audio starts before video) usually of 67ms for NTSC or 80ms for PAL (2 frames).

Donald Graft has corrected this problem with his own versions of DVD2AVI and MPEG2DEC3 called DGIndex and DGDecode available here.

Demultiplexing Seamless Joints

This problem always shows up at vobu boundaries and cell boundaries, as they are always seamless. The problem can also appear at vob boundaries if they are seamlessly multiplexed. The problem here is that audio and video have different buffer requirements in the players, with video being allowed a greater delay before presentation than audio. I know, there's that word "delay" - in any streaming digital media there is a delay for decoding information before it is presented (shown or heard). Each stream can have a different delay tailored to the needs of the encoding method. In the end the streams get synchronized again by a timestamp called "PTS" (Presentation Time Stamp). The time window for delivering data in each vobu is determined by the video data, audio and subpicture data is then added based on the requirements of each. Since audio has a shorter delay some of the audio for each vobu ends up in the next vobu.

An example would probably help a lot here. Let's say we are dealing with an NTSC DVD, the authoring program has chosen the typical video delay value of 25257 clock ticks, and the audio is AC3 at 1536 bytes per frame. The clock we are referring to is the 90KHz clock used by all timestamps in mpeg. The first vobu contains 12 frames, each having a duration of 3003 clock ticks, for a total of 36036 clock ticks. So since the delivery of data will begin at time 0, the video will begin at time 25257, and end at 25257+36036 = 61293. But because of buffer constraints the audio multiplexed into this vobu, which is gold in the graphic above, will end at time 45417. The rest of the audio, colored yellow in the graphic, must be delivered later, so it is in the next vobu which begins delivery at time 36036. If we demultiplex without considering this factor we end up missing 61293-45417 = 15876 clock ticks of audio (176.4 milliseconds). OK, not such a big problem, but if we start demultiplexing at the second vobu without considering this factor we have an extra 176ms of audio at the beginning that belonged to the previous vobu, hence the program will report a delay of -176ms.
Of course the values vary greatly, and can be especially high at seamless vob joints.

Proper demultiplexing, using the PTS values and not vobu boundaries, reduces the delay to no more than the duration of one audio frame, which for AC3 is 32ms. Should this delay be fed back into an authoring program? That depends on what is being done with it. First of all, the only time a delay can be applied during authoring is at the start of a non-seamless vob. Use the delay if the audio and video are being used as a clip, otherwise ignore the delay.
More articles
DVD-Video info home Copyright © 2006 - 2024 MPUCoder, all rights reserved.