example of under performing code from complexity and precision point of view

In the encoder function Get_corr_n_fx in lib_enc/ivas_stereo_td_analysis.c there is a loop computing some energies. The actual code is written like this

    IF
    {
    ...
    }
    ELSE
    {   guard_bits = find_guarded_bits_fx( len );
        FOR( i = 0; i < len; i++ )
        {
            mono_i = add( shr( L[i], Q1 ), shr( R[i], Q1 ) );                               // q_in
            corrL = L_add( corrL, L_shr( L_mult0( L[i], mono_i ), guard_bits ) );           // (q_in + q_in - guard_bits)
            corrR = L_add( corrR, L_shr( L_mult0( R[i], mono_i ), guard_bits ) );           // (q_in + q_in - guard_bits)
            ener = L_add( ener, L_shr( L_mult0( mono_i, mono_i ), guard_bits ) );           // (q_in + q_in - guard_bits)
            side_i = sub( shr( L[i], Q1 ), shr( R[i], Q1 ) );                               // q_in
            ener_side = L_add( ener_side, L_shr( L_mult0( side_i, side_i ), guard_bits ) ); // (q_in + q_in - guard_bits)
        } //->> 18 operation with many truncations 
    }

 mono_i = add( shr( L[i], Q1 ), shr( R[i], Q1 ) );                               // q_in

This line is losing unnecessary precision by right shifting before doing the accumulation. The following lines could be done using 64 bits accumulators such that the right shift can be done only after the loop.

Here is an example of how it could be done to save complexity and increase precision:

    IF
    {
    ...
    }
    ELSE
    {   
        guard_bits = sub(32,find_guarded_bits_fx( len ));
        FOR( i = 0; i < len; i++ )
        {   
            mono_i = round_fx(L_mac(L_mult(L[i], 16384), R[i], 16384) );    // q_in 
            WcorrL = W_mac0_16_16( WcorrL , L[i], mono_i );   // (q_in + q_in )
            WcorrR = W_mac0_16_16( WcorrR , R[i], mono_i );   // (q_in + q_in )
            Wener = W_mac0_16_16( Wener , mono_i, mono_i );   // (q_in + q_in )
            side_i = round_fx(L_msu(L_mult( L[i], 16384 ), R[i], 16384 ) );
            Wener_side = W_mac0_16_16( Wener_side , side_i, side_i);   // (q_in + q_in )
        } //->> 10 operations
        /* Scaling back to the original proposal */
        corrL = W_extract_h(W_shl( WcorrL , guard_bits));     
        corrR = W_extract_h(W_shl( WcorrR , guard_bits));
        ener = W_extract_h(W_shl( Wener , guard_bits));
        ener_side = W_extract_h(W_shl( Wener_side , guard_bits));
       
    }

Total complexity would be (10 * len + 8) instead of (18 * len) give len can be 960 -> 9608 ops instead of 17280 ops

Edited Jul 23, 2024 by vaillancour