[non-BE][allow regression] Optimize ivas_dirac_dec_binaural_formulate_input_covariance_matrices
Closes #2157
Summary
The ivas_dirac_dec_binaural_formulate_input_covariance_matrices_fx function makes extensive use of float-like operations (e.g. BASOP_Util_Add_Mant32Exp) which are computationally expensive and not really necessary. Those operations can be swapped with cheaper 64-bit low-level operations.
Complexity analysis
Good improvement from 138.376 WMOPS to 135.285 WMOPS.
Before:
--- Complexity analysis [WMOPS] ---
|------ SELF ------| |--- CUMULATIVE ---|
routine calls min max avg min max avg
--------------- ------ ------ ------ ------ ------ ------ ------
ivas_jbm_dec_tc 1.00 1.891 1.915 1.915 26.345 38.997 29.507
ivas_spar_decode 1.00 1.010 1.041 1.019 2.544 2.684 2.628
ivas_spar_dec_MD 1.00 1.526 1.665 1.609 1.526 1.665 1.609
ivas_sce_dec 1.00 0.246 0.246 0.246 21.794 34.495 24.964
ivas_core_dec 1.00 3.152 11.390 7.909 21.548 34.249 24.718
acelp_core_dec 0.61 12.830 19.132 14.543 12.830 19.132 14.543
ivas_dec_prepare_renderer 1.00 7.068 8.724 7.173 7.068 8.724 7.173
ivas_dec_render 1.00 77.336 88.004 87.518 81.487 92.262 91.776
ivas_sba_prototype_renderer 4.00 3.922 4.258 4.258 3.922 4.258 4.258
stereo_tcx_core_dec 0.39 17.651 31.015 20.324 17.651 31.015 20.324
--------------- ------ ------ ------ ------
total 1000.00 119.955 138.376 128.456
After:
--- Complexity analysis [WMOPS] ---
|------ SELF ------| |--- CUMULATIVE ---|
routine calls min max avg min max avg
--------------- ------ ------ ------ ------ ------ ------ ------
ivas_jbm_dec_tc 1.00 1.891 1.915 1.915 26.345 38.997 29.507
ivas_spar_decode 1.00 1.010 1.041 1.019 2.544 2.684 2.628
ivas_spar_dec_MD 1.00 1.526 1.665 1.609 1.526 1.665 1.609
ivas_sce_dec 1.00 0.246 0.246 0.246 21.794 34.495 24.964
ivas_core_dec 1.00 3.152 11.390 7.909 21.548 34.249 24.718
acelp_core_dec 0.61 12.830 19.132 14.543 12.830 19.132 14.543
ivas_dec_prepare_renderer 1.00 7.068 8.724 7.173 7.068 8.724 7.173
ivas_dec_render 1.00 75.226 86.621 85.357 79.377 90.879 89.615
ivas_sba_prototype_renderer 4.00 3.922 4.258 4.258 3.922 4.258 4.258
stereo_tcx_core_dec 0.39 17.651 31.015 20.324 17.651 31.015 20.324
--------------- ------ ------ ------ ------
total 1000.00 117.845 135.285 126.295
Accuracy analysis
The optimised implementation does not normalise/truncate 64-bits integers, tries to be as precise as possible when performing summations and multiplications with 64-bits integers. Unfortunately, it does not operate on normalised values (as the current implementation does), and that makes this optimised implementation slightly less accurate than the current one when processing "tiny" input values.
30/10/2024
Implemented "unit test" which compares the computation of the current fixed-point implementation vs optimised fixed-point implementation vs single-precision floating-point implementation vs double-precision floating-point implementation.
I have discovered various aspects that make the results of the optimised fixed-point implementation slightly diverge from the results of the current fixed-point implementation. One important cause of the results divergence is that the current fixed-point implementation operates on normalised values (because of the pseudo float operations) and as a consequence it is more accurate when processing tiny quantities. The optimised fixed-point implementation does not normalise any value, performs all operations with 64-bits integers and it is more accurate when processing large quantities. In the specific test that I am running, the inputs to the function do not use the full 32-bit range of values and that makes the current fixed-point implementation slightly more accurate than the optimised one when computing IIReneLimiter. The optimised implementation is always more accurate than the current one when computing SubFrameTotalEne.
There is a solution to make the optimised implementation more accurate for small values, but that would increase the computation a little bit (hard to quantify by how much) and memory usage. The idea would be to implement a small int96_t/int128_t library. In this way ivas_dirac_dec_binaural_formulate_input_covariance_matrices_fx would be as accurate as the double-precision floating point implementation.
Another solution would be to keep the optimised implementation as it is and make sure the inputs to the function use the full range of values (i.e. use a better Q format). If the inputs do not occupy the full range of the fixed-point format, then we have unused or "wasted" leading bits, which reduce the effective precision of the representation and lead to poor results (as it is happening in this case).