iPhone VFP and memory performance

I have array of floats for output buffer and array of shorts for input. I need to add values from input buffer to values in output buffer. Using VFP unit the code looks as follows:

int temp[8];
while (numVectors--)
{   
  temp[0] = bin[0];
  temp[1] = bin[1];
  temp[2] = bin[2];
  temp[3] = bin[3];
  temp[4] = bin[4];
  temp[5] = bin[5];
  temp[6] = bin[6];
  temp[7] = bin[7];
  bin+=8;
                    
  ASM ("fldmias  %0, {s8-s15} \n\t"
       "fldmias  %2, {s16-s23} \n\t"
       "fsitos s16,s16 \n\t"
       "fsitos s17,s17 \n\t"
       "fsitos s18,s18 \n\t"
       "fsitos s19,s19 \n\t"
       "fsitos s20,s20 \n\t"
       "fsitos s21,s21 \n\t"
       "fsitos s22,s22 \n\t"
       "fsitos s23,s23 \n\t"
       "fadds s8, s8, s16 \n\t"
       "fstmias  %0!, {s8-s15} \n\t" 
       : "=r" (bout)
       : "0" (bout), "r" (temp)
       : (long reg list was here);
}

So shorts first converted to ints (pair of ldrsh/str operations), then loaded into VFP vector, converted to floats and added to existing values in output buffer eight-at-once. This works without problems and is fast.

Then I tried to preconvert shorts to floats and use array of floats as input to get rid of extra short->int->float conversion:

while (numVectors--)
{
  ASM ("fldmias  %0, {s8-s15} \n\t"
       "fldmias  %1!, {s16-s23} \n\t"
       "fadds s8, s8, s16 \n\t"
       "fstmias  %0!, {s8-s15} \n\t" 
       : "=r" (bout), "=r" (fbin)
       : "0" (bout), "1" (fbin)
       : (long reg list was here);
}

Imaging how was I surprised when measurements showed this code is actually much slower then previous. I’m not quite sure why but I think this is because array of floats takes twice more memory than array of shorts of the same length. It seems because of some caching access to large regions of memory is much slower than extra copy operations and VFP conversion together but accessing lesser memory range.

You should never "optimize" anything without doublechecking that things are really optimized, even if it looks obvious.