Skip to content

Basop Encoder: Stereo DTX Front VAD Flag Mismatch

Basic info

I do not have specific SHA for the initial analysis. The commit for both was from April 3rd. However I have tested some of the items with these versions:

  • Float reference:
  • Fixed point:
    • Encoder (fixed): 1ccc3c58
    • Decoder (fixed):

Bug description

The front VAD flag decisions were investigated for floating point and fixed point code outputs for low bitrate (13.2kbps) Stereo DTX with 71 test items. Most of the items contain background noise of different versions (such as train station or office). Some test items contain both speech and background noise.

The percentage of frames with different front VAD flags (float vs fixed) for each test item is presented:

DifferentFramesPercentage71.svg

The highest difference is 6.5% for a car sound item without speech. I have plotted the difference between the left channels of the outputs along with FrontVAD difference:

4VWGolf90spclab

The different frontVAD decisions affect the whole signal and the outputs sound very different. There is already an open issue about this specific item: #1410 (comment 69661)

For a speech+background item with 2.4% difference, the plots look like this:

SpeechTrain2spclab

The difference after 10 seconds causes fixed point suppressing the train sound in the background.

Floating point:

RefSpTr2

Fixed:

DutSpTr2

For the same item, there is an energy increase in higher frequencies around 17 seconds, where frontVAD differs.

Ways to reproduce

Box folder: ...\Box_EXTERNAL_IVAS_BASOP_VERIFICATION\issues\issue-1487

./IVAS_cod_ref.exe -dtx -stereo 13200 32 spTr2.wav bitref
./IVAS_dec_ref.exe stereo 32 bitref refSpTr2.wav

 ./IVAS_cod.exe -dtx -stereo 13200 32 spTr2.wav bit
./IVAS_dec_ref.exe stereo 32 bit dutSpTr2.wav

There are a few speech+background noise items where frontVAD difference results in distortion or spectral differences in between speech segments.

Issue is labeled as medium since for the test items mentioned here, the differences mainly occur for background sound segments, and not the speech segments.

Edited by Sumeyra Demir Kanik