"On the limits of GPU Acceleration" (short summary from Richard Vuduc): http://v...

"On the limits of GPU Acceleration" (short summary from Richard Vuduc): http://vuduc.org/pubs/vuduc2010-hotpar-cpu-v-gpu.pdf

"Understanding the design trade-offs among current multicore systems for numerical computations" (somewhat more technical): http://dx.doi.org/10.1109/IPDPS.2009.5161055

"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU" (from Intel, but using the best published GPU implementations): http://doi.acm.org/10.1145/1815961.1816021

BTW, you may as well cite CUSP (http://code.google.com/p/cusp-library/) for the sparse implementation, it's not part of CUDA despite being developed by Nathan Bell and Michael Garland (NVidia employees).