In the case of DSP vs. ASM in your example, there's no difference because they are running exactly the same code - the DSP code is translated to ASM internally, and in this case it gives exactly the same instructions.
The primitive is lighter because something similar happens for stream primitives - not just the primitives, but also the connections between them, are converted to ASM internally, and this sometimes allows FS to use some optimisations which aren't possible within the ASM/DSP blocks.
The general principle is this. Instructions using only the "xmm" registers are very fast, but anything which uses float variables, streamins, and streamouts requires storing things in memory, which is much slower. The DSP->ASM translator doesn't always optimise very well - it sometimes (but not in this case) uses memory reading/writing where it doesn't really need to, and we can often optimise these better by hand, by keeping values inside "xmm" registers instead of reading/writing memory.
For example, when you look at the ASM output of a DSP code block, you sometimes see something like this...
- Code: Select all
movaps xmm0, variable
// Process xmm0
movaps variable, xmm0
movaps xmm0, variable
// Process xmm0
movaps variable, xmm0
In this case, the middle two "movaps" lines are not necessary - it's storing something only to load it straight back into the same place, and the final "movaps" at the end is all we need to make sure that the final result gets stored.