example of under performing code from complexity and precision point of view
In the encoder function Get_corr_n_fx in lib_enc/ivas_stereo_td_analysis.c there is a loop computing some energies. The actual code is written like this
IF
{
...
}
ELSE
{ guard_bits = find_guarded_bits_fx( len );
FOR( i = 0; i < len; i++ )
{
mono_i = add( shr( L[i], Q1 ), shr( R[i], Q1 ) ); // q_in
corrL = L_add( corrL, L_shr( L_mult0( L[i], mono_i ), guard_bits ) ); // (q_in + q_in - guard_bits)
corrR = L_add( corrR, L_shr( L_mult0( R[i], mono_i ), guard_bits ) ); // (q_in + q_in - guard_bits)
ener = L_add( ener, L_shr( L_mult0( mono_i, mono_i ), guard_bits ) ); // (q_in + q_in - guard_bits)
side_i = sub( shr( L[i], Q1 ), shr( R[i], Q1 ) ); // q_in
ener_side = L_add( ener_side, L_shr( L_mult0( side_i, side_i ), guard_bits ) ); // (q_in + q_in - guard_bits)
} //->> 18 operation with many truncations
}
mono_i = add( shr( L[i], Q1 ), shr( R[i], Q1 ) ); // q_in
This line is losing unnecessary precision by right shifting before doing the accumulation. The following lines could be done using 64 bits accumulators such that the right shift can be done only after the loop.
Here is an example of how it could be done to save complexity and increase precision:
IF
{
...
}
ELSE
{
guard_bits = sub(32,find_guarded_bits_fx( len ));
FOR( i = 0; i < len; i++ )
{
mono_i = round_fx(L_mac(L_mult(L[i], 16384), R[i], 16384) ); // q_in
WcorrL = W_mac0_16_16( WcorrL , L[i], mono_i ); // (q_in + q_in )
WcorrR = W_mac0_16_16( WcorrR , R[i], mono_i ); // (q_in + q_in )
Wener = W_mac0_16_16( Wener , mono_i, mono_i ); // (q_in + q_in )
side_i = round_fx(L_msu(L_mult( L[i], 16384 ), R[i], 16384 ) );
Wener_side = W_mac0_16_16( Wener_side , side_i, side_i); // (q_in + q_in )
} //->> 10 operations
/* Scaling back to the original proposal */
corrL = W_extract_h(W_shl( WcorrL , guard_bits));
corrR = W_extract_h(W_shl( WcorrR , guard_bits));
ener = W_extract_h(W_shl( Wener , guard_bits));
ener_side = W_extract_h(W_shl( Wener_side , guard_bits));
}
Total complexity would be (10 * len + 8) instead of (18 * len) give len can be 960 -> 9608 ops instead of 17280 ops
Edited by vaillancour