I guess it depends on what kind of precision you need - just a few taylor iterations should give you a decent cos/sin approximation, dunno how big a LUT that would correspond to, but "more than a few cache lines". Friend of mine says that a 4096-entry LUT was too small for his 3D engine, it was too jerky (and when dealing with 3D operation, the sin/cos calls aren't too costly compared to the matrix operations anyway).
4k LUT table of single-precision floats = 16kb, or 256 x 64-byte cache lines (and still not good enough accuracy). Cache misses are pretty expensive. And especially if you're dealing with "
hundreds of vectors simultaneously", my
guess is that SSE code with a few taylor iterations would be faster than using LUTs.
As for GPU offloading, you obviously design the code for the GPU generation you're targetting; texture lookups for older generations, calculations for newer. But I have no experience with GPU programming, so the above is basically all I know about it