Web Resources for CS U610:
GP GPU Programming (aka Senior Honors Seminar)
Instructors:
Gene Cooperman
and
David Kaeli
Spring, 2008
Contents
- Room and Syllabus
- Running on new NVIDIA hardware
- GPU Projects
- Literature on Algorithms Converting
Random Access to Streaming Access
- Bulletin: How to use global atomic add and friends;
and other news
- Downloading the CUDA emulator
- SSH to course CUDA emulator
- Programming support
- Course Internal Links
(including NVIDIA G8
and CUDA Overview)
- Course External Links
Room: 166 West Village H
Syllabus:
syllabus.pdf
or syllabus.dvi .
Syllabus (part 2: rough schedule over semester):
syllabus-2.pdf
or syllabus-2.dvi .
Please begin looking at the
GPU Projects web page. You'll
find there a list of possible GPU projects that will continue to grow
over the Spring break. After Spring break, I will continue to answer
questions on the tasks required for individual projects, and I will
then ask you to choose a project.
Unfortunately, right now, it's set up as a protected web page.
It requires a CCIS account for access. (There is no N.U. access option.)
If that's a problem, I'll either mail out a copy of the web
page from time to time, or try to get a course Wiki set up that allows
wider access.
This is an e-mail from Dana Schaa (one of the TAs). In general,
if you have problems running, please write to the course TAs mailing
list. If you find interesting gotchas or other insights, please share
with the whole class using the course mailing list. Thanks. - Prof. Cooperman
Hey everyone,
Craig brought something to my attention. You can't just copy your SDK
over from the previous machine, because the libraries it contains are
compiled for a 32-bit system (the new one is 64-bit). Instead, I
placed the correct version of the SDK in the home directory when you
log in. You should copy (NOT MOVE) the SDK to the directory that you
create (ex. >>cp -r NVIDIA_CUDA_SDK mydir/), and then copy your
project folder over from the previous machine.
Please let me know if anyone has trouble getting their code to run.
- Dana Schaa
Recall that for dealing with large data not fitting in shared cache, we
wish to emphasize streaming access over random access. This is because
a single access to global memory may require 400 GPU cycles. These
papers provide some algorithms for doing so. As you find other papers,
please add them to this list.
- "Overcoming the Memory Wall in
Symbolic Algebra: A Faster Permutation Algorithm" (Formally
Reviewed Communication),
G. Cooperman and X. Ma,
Communications in Computer Algebra
(CCA -- SIGSAM Bulletin) 36(4), pp. 1--4, 2002.
(SIGSAM Bulletin is now: Communications in Computer Algebra (CCA))
(.dvi.gz,
.ps.gz,
.pdf)
(Given the following code fragment for object rearrangement or permutation
multiplication:
object Z[], Y[]; int X[];
for (int i = 0; i < N; i++) Z[i] = Y[X[i]];
How can you do it faster?)
- "A Comparative Analysis of Parallel Disk-Based Methods
for Enumerating Implicit Graphs",
Eric Robinson, Daniel Kunkle and Gene Cooperman,
Proc. of 2007 International Workshop on
Parallel Symbolic and Algebraic
Computation (PASCO '07; a workshop of ISSAC-07),
ACM Press, 2007, pp. 78--87
(.pdf)
The central issue of using disk-based methods is to convert low
level algorithms and data structures from a form using random acces to
a form using streaming access. This paper discusses several such
methods in the context of search.
-
In order to use global atomic add and friends in CUDA, you must
place the following line in the CUDA Makefile that NVIDIA provides.
(The makefile is the
$CUDA/projects/simpleAtomics Makefile
.)
SMVERSIONFLAGS= -arch sm_11
The global atomic functions were added to CUDA only as of version 1.1.
This line says to use the NVIDIA architecture for version 1.1.
-
On a separate note, if you are looking for highly polished realistic
models of CUDA programming, please look at the examples in the SDK.
There is also a version of that available from the web, which this
web page links to, below.
-
Finally, by the end of Tuesday, Feb. 26, please mail your quicksort
solutions to Ayse Yilmazer at the address given in class.
(There were three people who missed our class on Friday. We discussed
details of the CUDA language, and how some of the unusual things were
motivated by the programming model, which was motivated by the hardware.
Some examples were
- There are 32 threads in a warp, because there aer 8 cores per chip.
The first group of 8 threads requires 4 cycles, and each successive
group of 8 threads requires 1 cycle. A little math shows that a
warp of 4 or 8 groups is the sweet spot (average cycles per group is
low, and the warp is not excessively long).
-
A warp executes a single instruction. So, one would think that the
warps in a thread block would also execute together. If they did,
one would not need
syncthreads
. But warps do not
execute at the same time. To understand why, suppose the current
instruction is to read from global memory. The first warp may succeed.
But due to congestion from other chips (other SMs), the second warp
may be very slow. In this case, the first warp can continue to execute
later instructions not requiring global memory while we wait on the
second warp. We don't want the first warp finishing phase 1 of a
computation and going on without seeing the results of the second
warp. So, we must insert a syncthreads at the end of Phase 1
to tell the first warp to stop and wait for the other warps.
- The CUDA system calls use output parameters rather than returning
their answer in a function result. This may indicate the influence
of FORTRAN, which they also support.
- To copy from host memory to device meory (always global memory),
you must use the parameter to indicate whether you are going toward
device memory or to host memory. This is because the other parameters
are simple addresses, and there is nothing in the declarations to
tell whether some user pointer came from host or device memory.
A different, more highly typed language design might have eliminated
this requirement.
-
Constant cache is different in style from shared cache. Constant cache
is an automatically managed read-only cache, with its
backing data in global memory. Like all global memory, you initialize
it from the host, and it persists across multpiple kernels. The on-chip
cache will then bring it automaticaly from global memory to cache,
as needed.
If there is too much constant data,
then only some of it will be in the constant cache (the most recent
accesses). If all threads of a warp access the same constant address,
it proceeds in one step. If they require 5 distinct constant
addresses (for example), it proceeds in five steps, taking five times
as long. In this situation, we don't want one thread modifying
the common address while the others read from it, since the lack
of ordering would lead to undefined behavior. So, NVIDIA made
this automatically managed cache read-only. Clearly, NVIDIA also decided
that a read-only cache is cheaper for hardware real-estate then a
full read-write cache.
- To copy from global memory on the device to shared memory, a simple
assignment statement (using "=") suffices. This is because you have
declared shared memory with the "__shared" keyword. So, the compiler
always knows in which direction you are going.
-
Note the example for usnig cudaMallocPitch, taken from the manual.
Presumably, this is done to have simultaneous access to multiple
memory banks of the global RAM. This seems to malloc many
buffers: one for each row. The buffers are spaced out so that
successive accesses occur in separate memory banks. This optimizes
access to two dimensional data structures.
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{ for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
} } }
- Please help me to add any useful insights as
they occur to us. You can tell them to me in class, or e-mail them to me,
and I will add them here.
We will provide a CCIS Linux computer available via ssh that has
the emulator. If you prefer, you may download the CUDA emulator
onto your own computer and use it there.
If you do have an NVIDIA G8 graphics card, you will need
to use the "CUDA SDK version 1.1 for Linux", "for Windows XP", etc.
(The key phrase is SDK: Software Development Kit.)
When you read it, you will see that it supports only Windows XP,
Re hat Enterprise Linux and SUSE Linux.
If you do not have an NVIDIA G8 graphics card, please use this
alternative installation method, which does not require
any graphics card or graphics driver. These instructions are for Linux.
If someone tries the analogous method for Windows, please report back
the details of what you had to do.
- Check that you have gcc, g++, gdb, freeglut3 (Debian/Ubuntu/other package),
freeglut3-dev (Debian/Ubuntu/other package).
- To make the CUDA emulator work, we need to install some, but not all of the additional libraries
found in the graphics driver rpm package.
For 32-bit architectures, download:
ftp://download.nvidia.com/opensuse/10.2/i586/x11-video-nvidiaG01-169.07-1.1.i586.rpm
For 64-bit architectures, download
ftp://download.nvidia.com/opensuse/10.2/x86_64/x11-video-nvidiaG01-169.07-1.1.x86_64.rpm
To extract the files from a .rpm file, do:
rpm2cpio MYFILE.rpm | cpio -idv
(Typically, all of the files will be extracted into
a subdirectory usr
that mirrors /usr
.)
- Extract the file "usr/lib/libcuda.so.169.07" and place it in
/usr/lib
.
Extract the folder "usr/include/nvidia" and place it in
/usr/include
.
In addition, if you have a 64-bit architecture, then
extract the file "usr/lib64/libcuda.so.169.07"
and place it in /usr/lib64
.
- Create the following symlinks:
ln -s libcuda.so.169.07 /usr/lib/libcuda.so
ln -s libglut.so.3 /usr/lib/libglut.so
If you have a 64-bit architecture, also do:
ln -s libcuda.so.169.07 /usr/lib64/libcuda.so
- Download and install the correct CUDA Toolkit version 1.1
for your Linux distribution,
provided at
http://www.nvidia.com/object/cuda_get.html .
- Download and install (unprivileged, not as root) the CUDA SDK version 1.1 for Linux
provided at
http://www.nvidia.com/object/cuda_get.html .
(The SDK will install in your home directory.)
- Try some
CUDA code samples,
and compile them according to the instructions in
CUDA SDK Release Notes. (But skip the instructions on
installing the SDK using
NVIDIA-Linux-*-pkg*.run
,
since you've done that already, and you don't want to overwrite your
graphics drivers.)
Accounts for our CUDA emulator. (Initially, no password required. Please set a password
the first time that you login.)
-
ssh USER@129.10.112.244
, or ssh USER@129.10.112.245
,
where USER is the username you gave me for the course mailing list, but not including the hostname.
(For gene@ccs.neu.edu, I do ssh gene@129.10.112.245.)
Please note that 129.10.112.244 and 129.10.112.245 are independent machines with
independent user accounts.
- Execute:
/usr/local/src/NVIDIA_CUDA_SDK_1.1_Linux.run
[the first time that you
login, only]
- Try compiling an example, as described in the
CUDA SDK Release Notes .
Since the NVIDIA G8 architecture is a very new one, it will be useful to
keep all of the relevant staff about all questions and issues
that are encountered.
For programming support, please write to the e-mail account
"csu610-tas" with the same hostname as for my own e-mail: gene@ccs.neu.edu .
(I write it this way to protect us from unwanted 'bots.)
The new alias should be ready by the end of Monday, Jan. 14.
The alias will go to:
- Gene Cooperman (gene@ccs.neu.edu)
- Kapil Arya
- Dana Schaa
- Ayse Yilmazer
We're happy to give you the full set of e-mail addresses in class, but
we don't wish to place them on the web.
Slides of a mini-course on general parallel programming:
slides