Professional Documents
Culture Documents
EE 4702 Take-Home Pre-Final Questions: Solution
EE 4702 Take-Home Pre-Final Questions: Solution
EE 4702
Take-Home Pre-Final Questions
Due: 9 December 2013
Please work on this exam alone. You may use references for OpenGL,
CUDA, and similar topics.
Good Luck!
Problem 1: [20 pts]The code below is the data_unpack routine from the Homework 3 solution. Recall
that the code in Homework 3 requires that the number of threads must be no smaller than the number of
balls (the value of chain_length).
Re-write the routine so that it works correctly even if the number of balls is greater than the number of
threads.
// SOLUTION
__global__ void data_unpack_mb() {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
const int num_threads = blockDim.x * gridDim.x;
for ( int bi = tid; bi < dc.chain_length; bi += num_threads )
{
Ball* const ball = &dc.d_balls[bi];
ball->position = dc.d_pos[bi];
}
}
2
Problem 2: [30 pts]Recall that the code for Homework 3 simulated a chain of balls (or string of beads). In
this problem we’ll extend that code by giving the balls electric charge, either positive or negative. The balls
are sealed in a special coating that retains the charge.
The CUDA kernel below applies the force due to these charges. Like the penetration routine it must consider
nearly all possible pairs. The code is written to optionally use shared memory when accessing the “b” object.
if ( a_idx != j )
{
float4 pos_b = use_cache ? c_pos[b_idx_start] : dc.d_pos[j];// Global Access C
pNorm a_to_b = mn(pos_a,pos_b);
const bool repel = ( a_idx & 1 ) == ( j & 1 );
force += ( repel ? -0.15f : 0.15f ) / a_to_b.mag_sq * a_to_b;
}
}
3
Problem 2, continued:
(a) In this part assume that use_cache is false, and so shared memory won’t be used. Compute the amount
of data requested due to accesses to d_pos and compute how much of that data is actually used. Do this for
the following configuration: chain_length=256, a block size of 1024 threads, and a grid size of 16 blocks.
The code runs on an NVIDIA Kepler GPU which has memory request sizes of 32, 64, and 128 bytes. For
your convenience, comments show where global memory can be accessed. (Array c_pos is not in global
memory, and is anyway won’t be needed until the next subproblem.)
Combining the contributions from the two global accesses (ignoring Global Access B since use cache is off): Total data requested
is 16n + c2 = 218 + 216 = 262144 + 65536 = 327680 bytes. The amount of data needed is 16n + 21 c2 = 218 + 215 =
262144 + 32768 = 294912 bytes.
4
(b) Repeat the problem above for the case when use_cache is true. Remember that memory requests are
not generated when reading or writing c_pos itself.
(c) The d_pos accesses that are used to fill shared memory, c_pos[b_idx_start] = dc.d_pos[j];, are
wasteful (though the waste is small in proportion to the total amount of data accessed). Explain why the
accesses are wasteful and fix the problem.
Fix problem.
Why are these accesses wasteful?
The accesses are wasteful because at most one thread in a warp is active, and so half the request goes unused. (See the solution to
the previous problem.) The code below fixes the problem by having the first few threads load shared memory.
// SOLUTION
const int b_idx_start_thd_0 = blockIdx.x * blockDim.x / dc.chain_length;
const int threads_per_object_per_block = blockDim.x / dc.chain_length;
int b_to_shared_idx =
b_idx_start_thd_0 + threadIdx.x % threads_per_object_per_block;
5
(d) The amount of memory read when assigning pos_a can be reduced by using shared memory. Modify the
code to do so.
// SOLUTION
__shared__ float4 c_pos[1024];
if ( threadIdx.x < dc.chain_length ) c_pos[threadIdx.x] = dc.d_pos[a_idx];
__syncthreads();
float4 pos_a = c_pos[a_idx];
6
Problem 3: [15 pts]Consider the following excerpt from the Cone code from the Homework 2 solution.
(File hw2-sol.cc or visit http://www.ece.lsu.edu/koppel/gpup/gpup/2013/hw2-sol.cc.html).
glBindBuffer(GL_ARRAY_BUFFER, buffer_obj_coords);
glBufferData(GL_ARRAY_BUFFER, coords.occ() * sizeof(float),
coords.get_storage(), GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, buffer_obj_norms);
glBufferData(GL_ARRAY_BUFFER, norms.occ() * sizeof(float),
norms.get_storage(), GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, buffer_obj_norms);
glNormalPointer(GL_FLOAT,0,NULL);
glEnableClientState(GL_NORMAL_ARRAY);
// Draw the cones. (There will be minor flaws since 1 strip used.)
glDrawArrays(GL_QUAD_STRIP,0,num_coords);
7
Problem 3, continued:
(a) The code uses buffer objects for vertex coordinates and normals. Why doesn’t it also use a buffer object
for color?
(b) Modify the code so that it uses a buffer object for color, even if that’s not a good idea.
(c) Suppose that each time the routine above was called opt_lod had a different value. Also suppose that
if ( !buffer_obj_coords ) were somehow changed to if ( true ).
8
Problem 4: [20 pts]Consider another excerpt from the Cone code from the Homework 2 solution. (File
hw2-sol.cc starting at line 266. or visit
http://www.ece.lsu.edu/koppel/gpup/gpup/2013/hw2-sol.cc.html).
if ( !dont_set_color ) glColor3fv(color);
glBindBuffer(GL_ARRAY_BUFFER, buffer_obj_coords);
glVertexPointer( 3, GL_FLOAT, 0, NULL);
glEnableClientState(GL_VERTEX_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, buffer_obj_norms);
glNormalPointer(GL_FLOAT,0,NULL);
glEnableClientState(GL_NORMAL_ARRAY);
glDrawArrays(GL_QUAD_STRIP,0,num_coords);
The code at the top of the excerpt computes a transform matrix that will position and scale the cone to the
desired location. The transform is computed in terms of coordinate base, scalar radius, and vector to_apex.
The command glMultTransposeMatrixf takes the transform matrix we computed and multiplies it by the
existing modelview matrix. The resulting matrix is written as an OpenGL Shading Language uniform with
name gl_ModelViewMatrix.
In this problem we’ll consider computing the transform on the GPU instead. The GPU code will compute
the transform in terms of base, radius, and to_apex.
9
Problem 4, continued:
(a) Declare base, radius, and to_apex as uniforms as they would be in an OpenGL SL source file. Hint: Use
the demo-10 (Vertex and Geometry Shaders) code as an example, and look for wire_radius. The demo-10
code can be accessed from the repo or
http://www.ece.lsu.edu/koppel/gpup/code/gpup/demo-10-shader.cc.html (CPU code),
http://www.ece.lsu.edu/koppel/gpup/code/gpup/demo-10-shdr-simple.cc.html (simple shaders), and
http://www.ece.lsu.edu/koppel/gpup/code/gpup/demo-10-shdr-geo.cc.html (more elaborate shaders).
Solution appears below. Notice that the names of the vector data types in OpenGL shading language and CUDA differ, so base is
vec4, not float4.
// SOLUTION
layout ( location = 1 ) uniform vec4 base;
layout ( location = 2 ) uniform float radius;
layout ( location = 3 ) uniform vec3 to_apex;
(b) Write code to send the uniforms from the CPU to the GPU. Hint: See the previous hint.
The solution appears below. The command glUniformX writes a value into a uniform declared in an OpenGL shading language
program (as was done for the previous part). The first argument specifies the location, see above. The remaining arguments are the
values. Note that the digit near the end of the command specifies the number of arguments and that the last letter specifies the data
type, float in this case.
(c) Suppose that xform is computed in a vertex shader making use of the enormous floating-point capability
of the mighty GPU. Do you expect execution to be faster or slower than using the CPU to compute xform?
Explain.
Slower than the CPU. An individual GPU thread is slower than a CPU thread. The CPU computes the transform once for use by
all vertices, so the GPU’s parallelism is no advantage. Each vertex shader would need to compute the transform and would take more
time to do so than the CPU.
(d) Suppose that xform is computed in a geometry shader. Will execution be faster or slower than using the
vertex shader when we are rendering quad strips? (That is, the systems we are comparing both use quad
strips, one uses the vertex shader to compute the xform, the other uses the geometry shader.) Explain.
Note: I should have asked about triangle strips.
Assuming that lighting and other calculations performed by the vertex shader to not take much time, the geometry shader will be
faster. With quad strips, most vertices are shared by two quads and so the number of vertices is twice the number of quads. That
means the transform would be computed twice as many times if the vertex shader were used. (The answer would be the same number
of times if triangle strips were used.)
(e) Suppose that xform is computed in a geometry shader. Will execution be faster or slower than using the
vertex shader when we are rendering individual quads? (That is, the systems we are comparing both use
individual quads, one uses the vertex shader to compute the xform, the other uses the geometry shader.)
Explain.
With individual quads the geometry shader has a big advantage, since the vertex shader is called four times for each quad and so the
transform would be computed four times as often if the vertex shader were used.
10
Problem 5: [15 pts]Answer each question below.
(a) Texture access is performed in the fragment shader. Suppose texture access was performed by the vertex
shader and the texel values were interpolated, the same way other attributes such as color were.
(b) Inputs to a fragment shader have interpolation qualifiers, they are flat, noperspective, and smooth.
Why are they not needed for the inputs to the geometry shader?
(c) The OpenGL command glMatrixMode(GL_MODELVIEW), used in setting the modelview matrix, is depre-
cated. (Meaning that its use is discouraged and that it may be removed from the language.) Why was the
command once essential but now considered unnecessary?
11