I have some examples showing how to implement Intel AVX and ARM NEON code in assembly language or using C/C++ intrinsics. Intrinsics have an advantage of being inlined by the compiler where assembly is a function call.
https://github.com/atribelli/matrix3d