[Complexity] Optimize v_add_inc_fx()

Basic info

This is a sub-task of issue #1009 (closed).

Bug description

The function v_add_inc_fx() is basically only used for interleaved input (x_inc == x2_inc == 2, y_inc == 1, &x1[1] == &x2[0]. For this case, the pointer/array addresses don't need to be computed in BASOP, but it's simple arithmetic where no instrumentation is needed; it's proposed to treat this as a special case in the function, in order to save complexity.

void v_add_inc_fx(
    const Word32 x1[],   /* i  : Input vector 1                       Qx*/
    const Word16 x_inc,  /* i  : Increment for input vector 1         Q0*/
    const Word32 x2[],   /* i  : Input vector 2                       Qx*/
    const Word16 x2_inc, /* i  : Increment for input vector 2         Q0*/
    Word32 y[],          /* o  : Output vector that contains vector 1 + vector 2  Qx*/
    const Word16 y_inc,  /* i  : increment for vector y[]              Q0*/
    const Word16 N       /* i  : Vector length                         Q0*/
)
{
#ifndef PATCH 
    Word16 i;
    Word16 ix1 = 0;
    Word16 ix2 = 0;
    Word16 iy = 0;
#else
    Word16 i, ix1, ix2, iy;

    /* The use of this function is currently always for the interleaved input format, */
    /* that means, the following conditions are always true and thus obsolete.        */
    test();
    test();
    test();
    test();
    IF( ( sub( x_inc, 2 ) == 0 ) && ( sub( x2_inc, 2 ) == 0 ) && ( sub( y_inc, 1 ) == 0 ) && ( &x1[1] == &x2[0] ) )
    {
        /* Interleaved input case, linear output */
        FOR( i = 0; i < N; i++ )
        {
            y[i] = L_add( x1[2 * i + 0], x1[2 * i + 1] ); /*Qx*/
            move32();
        }
        return;
    }

    ix1 = 0;
    ix2 = 0;
    iy = 0;
#endif
    move16();
    move16();
    move16();
    FOR( i = 0; i < N; i++ )
    {
        y[iy] = L_add( x1[ix1], x2[ix2] ); /*Qx*/
        move32();
        ix1 = add( ix1, x_inc );  /*Q0*/
        ix2 = add( ix2, x2_inc ); /*Q0*/
        iy = add( iy, y_inc );    /*Q0*/
    }
    return;
}

Saves 3 cycles for FOR-iteration in the regular case.

Ways to reproduce

(Clear steps or refer to a failing automated test, e.g. with a pipeline link)

Edited Dec 13, 2024 by multrus