I guess it depends on what kind of precision you need - just a few taylor iterations should give you a decent cos/sin approximation, dunno how big a LUT that would correspond to, but "more than a few cache lines". Friend of mine says that a 4096-entry LUT was too small for his 3D engine, it was too jerky (and when dealing with 3D operation, the sin/cos calls aren't too costly compared to the matrix operations anyway).
4k LUT table of single-precision floats = 16kb, or 256 x 64-byte cache lines (and still not good enough accuracy). Cache misses are pretty expensive. And especially if you're dealing with "hundreds of vectors simultaneously
", my guess
is that SSE code with a few taylor iterations would be faster than using LUTs.
As for GPU offloading, you obviously design the code for the GPU generation you're targetting; texture lookups for older generations, calculations for newer. But I have no experience with GPU programming, so the above is basically all I know about it