Merge branch '2181-optimize-matrixtransp1mul_fx' into 2182-move-scaling-operations-outside-matrix-mul-operations