Web Resources for CS U610: GP GPU Programming (aka Senior Honors Seminar)

Instructors: Gene Cooperman and David Kaeli
Spring, 2008

Room and Syllabus
Running on new NVIDIA hardware
GPU Projects
Literature on Algorithms Converting Random Access to Streaming Access
Bulletin: How to use global atomic add and friends; and other news
Downloading the CUDA emulator
SSH to course CUDA emulator
Programming support
Course Internal Links (including NVIDIA G8 and CUDA Overview)
Course External Links

Room and Syllabus

Room: 166 West Village H
Syllabus: syllabus.pdf or syllabus.dvi .

Syllabus (part 2: rough schedule over semester): syllabus-2.pdf or syllabus-2.dvi .

Please begin looking at the GPU Projects web page. You'll find there a list of possible GPU projects that will continue to grow over the Spring break. After Spring break, I will continue to answer questions on the tasks required for individual projects, and I will then ask you to choose a project.

Unfortunately, right now, it's set up as a protected web page. It requires a CCIS account for access. (There is no N.U. access option.) If that's a problem, I'll either mail out a copy of the web page from time to time, or try to get a course Wiki set up that allows wider access.

Running on NVIDIA Hardware

This is an e-mail from Dana Schaa (one of the TAs). In general, if you have problems running, please write to the course TAs mailing list. If you find interesting gotchas or other insights, please share with the whole class using the course mailing list. Thanks. - Prof. Cooperman

Hey everyone,

Craig brought something to my attention. You can't just copy your SDK over from the previous machine, because the libraries it contains are compiled for a 32-bit system (the new one is 64-bit). Instead, I placed the correct version of the SDK in the home directory when you log in. You should copy (NOT MOVE) the SDK to the directory that you create (ex. >>cp -r NVIDIA_CUDA_SDK mydir/), and then copy your project folder over from the previous machine.

Please let me know if anyone has trouble getting their code to run.

- Dana Schaa

Literature on Algorithms Converting Random Access to Streaming Access

Recall that for dealing with large data not fitting in shared cache, we wish to emphasize streaming access over random access. This is because a single access to global memory may require 400 GPU cycles. These papers provide some algorithms for doing so. As you find other papers, please add them to this list.

"Overcoming the Memory Wall in Symbolic Algebra: A Faster Permutation Algorithm" (Formally Reviewed Communication), G. Cooperman and X. Ma, Communications in Computer Algebra (CCA -- SIGSAM Bulletin) 36(4), pp. 1--4, 2002. (SIGSAM Bulletin is now: Communications in Computer Algebra (CCA)) (.dvi.gz, .ps.gz, .pdf) (Given the following code fragment for object rearrangement or permutation multiplication:
object Z[], Y[]; int X[]; for (int i = 0; i < N; i++) Z[i] = Y[X[i]]; How can you do it faster?)
"A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs", Eric Robinson, Daniel Kunkle and Gene Cooperman, Proc. of 2007 International Workshop on Parallel Symbolic and Algebraic Computation (PASCO '07; a workshop of ISSAC-07), ACM Press, 2007, pp. 78--87 (.pdf) The central issue of using disk-based methods is to convert low level algorithms and data structures from a form using random acces to a form using streaming access. This paper discusses several such methods in the context of search.

Bulletin: How to use global atomic add and friends; and other news

In order to use global atomic add and friends in CUDA, you must place the following line in the CUDA Makefile that NVIDIA provides. (The makefile is the $CUDA/projects/simpleAtomics Makefile.)
SMVERSIONFLAGS= -arch sm_11
The global atomic functions were added to CUDA only as of version 1.1. This line says to use the NVIDIA architecture for version 1.1.
On a separate note, if you are looking for highly polished realistic models of CUDA programming, please look at the examples in the SDK. There is also a version of that available from the web, which this web page links to, below.
Finally, by the end of Tuesday, Feb. 26, please mail your quicksort solutions to Ayse Yilmazer at the address given in class. (There were three people who missed our class on Friday. We discussed details of the CUDA language, and how some of the unusual things were motivated by the programming model, which was motivated by the hardware. Some examples were
1. There are 32 threads in a warp, because there aer 8 cores per chip. The first group of 8 threads requires 4 cycles, and each successive group of 8 threads requires 1 cycle. A little math shows that a warp of 4 or 8 groups is the sweet spot (average cycles per group is low, and the warp is not excessively long).
2. A warp executes a single instruction. So, one would think that the warps in a thread block would also execute together. If they did, one would not need syncthreads. But warps do not execute at the same time. To understand why, suppose the current instruction is to read from global memory. The first warp may succeed. But due to congestion from other chips (other SMs), the second warp may be very slow. In this case, the first warp can continue to execute later instructions not requiring global memory while we wait on the second warp. We don't want the first warp finishing phase 1 of a computation and going on without seeing the results of the second warp. So, we must insert a syncthreads at the end of Phase 1 to tell the first warp to stop and wait for the other warps.
3. The CUDA system calls use output parameters rather than returning their answer in a function result. This may indicate the influence of FORTRAN, which they also support.
4. To copy from host memory to device meory (always global memory), you must use the parameter to indicate whether you are going toward device memory or to host memory. This is because the other parameters are simple addresses, and there is nothing in the declarations to tell whether some user pointer came from host or device memory. A different, more highly typed language design might have eliminated this requirement.
5. Constant cache is different in style from shared cache. Constant cache is an automatically managed read-only cache, with its backing data in global memory. Like all global memory, you initialize it from the host, and it persists across multpiple kernels. The on-chip cache will then bring it automaticaly from global memory to cache, as needed.
  If there is too much constant data, then only some of it will be in the constant cache (the most recent accesses). If all threads of a warp access the same constant address, it proceeds in one step. If they require 5 distinct constant addresses (for example), it proceeds in five steps, taking five times as long. In this situation, we don't want one thread modifying the common address while the others read from it, since the lack of ordering would lead to undefined behavior. So, NVIDIA made this automatically managed cache read-only. Clearly, NVIDIA also decided that a read-only cache is cheaper for hardware real-estate then a full read-write cache.
6. To copy from global memory on the device to shared memory, a simple assignment statement (using "=") suffices. This is because you have declared shared memory with the "__shared" keyword. So, the compiler always knows in which direction you are going.
7. Note the example for usnig cudaMallocPitch, taken from the manual. Presumably, this is done to have simultaneous access to multiple memory banks of the global RAM. This seems to malloc many buffers: one for each row. The buffers are spaced out so that successive accesses occur in separate memory banks. This optimizes access to two dimensional data structures.
```
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{ for (int r = 0; r < height; ++r) {
  float* row = (float*)((char*)devPtr + r * pitch);
  for (int c = 0; c < width; ++c) {
    float element = row[c];
  } } }
```
8. Please help me to add any useful insights as they occur to us. You can tell them to me in class, or e-mail them to me, and I will add them here.

Downloading the CUDA emulator

We will provide a CCIS Linux computer available via ssh that has the emulator. If you prefer, you may download the CUDA emulator onto your own computer and use it there.

If you do have an NVIDIA G8 graphics card, you will need to use the "CUDA SDK version 1.1 for Linux", "for Windows XP", etc. (The key phrase is SDK: Software Development Kit.) When you read it, you will see that it supports only Windows XP, Re hat Enterprise Linux and SUSE Linux.

If you do not have an NVIDIA G8 graphics card, please use this alternative installation method, which does not require any graphics card or graphics driver. These instructions are for Linux. If someone tries the analogous method for Windows, please report back the details of what you had to do.

Check that you have gcc, g++, gdb, freeglut3 (Debian/Ubuntu/other package), freeglut3-dev (Debian/Ubuntu/other package).
To make the CUDA emulator work, we need to install some, but not all of the additional libraries found in the graphics driver rpm package.
For 32-bit architectures, download: ftp://download.nvidia.com/opensuse/10.2/i586/x11-video-nvidiaG01-169.07-1.1.i586.rpm
For 64-bit architectures, download ftp://download.nvidia.com/opensuse/10.2/x86_64/x11-video-nvidiaG01-169.07-1.1.x86_64.rpm
To extract the files from a .rpm file, do:
rpm2cpio MYFILE.rpm | cpio -idv
(Typically, all of the files will be extracted into a subdirectory usr that mirrors /usr.)
Extract the file "usr/lib/libcuda.so.169.07" and place it in /usr/lib.
Extract the folder "usr/include/nvidia" and place it in /usr/include. In addition, if you have a 64-bit architecture, then extract the file "usr/lib64/libcuda.so.169.07" and place it in /usr/lib64.
Create the following symlinks:
ln -s libcuda.so.169.07 /usr/lib/libcuda.so
ln -s libglut.so.3 /usr/lib/libglut.so
If you have a 64-bit architecture, also do:
ln -s libcuda.so.169.07 /usr/lib64/libcuda.so
Download and install the correct CUDA Toolkit version 1.1 for your Linux distribution, provided at http://www.nvidia.com/object/cuda_get.html .
Download and install (unprivileged, not as root) the CUDA SDK version 1.1 for Linux provided at http://www.nvidia.com/object/cuda_get.html . (The SDK will install in your home directory.)
Try some CUDA code samples, and compile them according to the instructions in CUDA SDK Release Notes. (But skip the instructions on installing the SDK using NVIDIA-Linux-*-pkg*.run, since you've done that already, and you don't want to overwrite your graphics drivers.)

SSH to course CUDA emulator

Accounts for our CUDA emulator. (Initially, no password required. Please set a password the first time that you login.)

ssh USER@129.10.112.244, or ssh USER@129.10.112.245, where USER is the username you gave me for the course mailing list, but not including the hostname. (For gene@ccs.neu.edu, I do ssh gene@129.10.112.245.) Please note that 129.10.112.244 and 129.10.112.245 are independent machines with independent user accounts.
Execute: /usr/local/src/NVIDIA_CUDA_SDK_1.1_Linux.run [the first time that you login, only]
Try compiling an example, as described in the CUDA SDK Release Notes .

Programming support

Since the NVIDIA G8 architecture is a very new one, it will be useful to keep all of the relevant staff about all questions and issues that are encountered.

For programming support, please write to the e-mail account "csu610-tas" with the same hostname as for my own e-mail: gene@ccs.neu.edu . (I write it this way to protect us from unwanted 'bots.)

The new alias should be ready by the end of Monday, Jan. 14.

The alias will go to:

Gene Cooperman (gene@ccs.neu.edu)
Kapil Arya
Dana Schaa
Ayse Yilmazer

We're happy to give you the full set of e-mail addresses in class, but we don't wish to place them on the web.

Course Internal Links

Course External Links

Radix sort (Wikipedia) --- Please understand this algorithm and be prepared to describe the outline of CUDA code that will implement this algorithm.
POSIX Threads (Wikipedia) --- This provides a good overview of traditional POSIX threads, as executed on a CPU. Traditional POSIX threads are unrelated to NVIDIA CUDA threads, except that they are both examples of the general concept of threads.
NVIDIA Tesla Hardware Specs
CUDA ZONE (NVIDIA)
- NVIDIA CUDA Programming Guide (version 1.1)
- CUDA SDK Code Samples (note especially Scan (parallel prefix), and Scan of Large Arrays, and Parallel Reduction, and especially Basic Topics (optimizations and coding tricks) )
  - Multi-GPU programming (we have 2 GPU boards per machine)
  - Simple Streams (overlap GPU computation with loading the next kernel's host memory to device memory)
  - Async API (overlap GPU and CPU execution)
  - Simple Atomics (global atomic add, etc.)
  - Clock (accurately measure runtime on GPU)
- CUDA -- Wikipedia
- Introduction to CUDA for Computer Scientists
- U. of Illinois course (taught by Wen-mei Hwu and David Kirk)
- CUDA CUBLAS Library 1.1 (CUDA BLAS = Basic Linear Algebra
- PTX Parallel Thread Execution ISA Version 1.1 (the assembly language of the G8 architecture)
- CUDA Occupancy Calculator (computes the multiprocessor occupancy of a GPU for a CUDA kernel: The occupancy is the ration of active warps to the maximum number of warps supported on a multiprocessor of the GPU --- before compromises occur??) Subroutines)
ATI CTM Programming Guide (CTM has more architectural restrictions than using CUDA.)
Parallel computing (Wikipedia)
- The Landscape of Parallel Computing Research: A View from Berkeley
- Technical Paper with above title
- Dwarf Mine (7 Dwarfs + 6 = 13: computational kernels that account for a large portion of scientific and engineering software)
Parallel Processing (Python Wiki)

Slides of a mini-course on general parallel programming: slides

Web Resources for CS U610: GP GPU Programming (aka Senior Honors Seminar)

Contents