Synchronization FunctionsΒΆ
Currently, numba-dppy
only supports some of the SYCL synchronization operations. For synchronization of all threads in the same thread block, numba-dppy
provides a helper function called numba_dppy.barrier()
. This function implements the same pattern as barriers in traditional
multi-threaded programming: invoking the function forces a thread to wait until
all threads in the block reach the barrier, at which point it returns control
to all its callers.
numba_dppy.barrier()
supports two memory fence options:
numba_dppy.CLK_GLOBAL_MEM_FENCE
: The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. Using the option can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data. Passing no arguments tonumba_dppy.barrier()
is equivalent to setting the global memory fence option. For example,def no_arg_barrier_support(): # @dppy.kernel("void(float32[::1])") @dppy.kernel def twice(A): i = dppy.get_global_id(0) d = A[i] # no argument defaults to global mem fence dppy.barrier() A[i] = d * 2 N = 10 arr = np.arange(N).astype(np.float32) print(arr) with dpctl.device_context("opencl:gpu") as gpu_queue: twice[N, dppy.DEFAULT_LOCAL_SIZE](arr) # there arr should be original arr * 2, i.e. [0, 2, 4, 6, ...] print(arr)
numba_dppy.CLK_LOCAL_MEM_FENCE
: The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory. For example,
def local_memory():
blocksize = 10
# @dppy.kernel("void(float32[::1])")
@dppy.kernel
def reverse_array(A):
lm = dppy.local.array(shape=10, dtype=float32)
i = dppy.get_global_id(0)
# preload
lm[i] = A[i]
# barrier local or global will both work as we only have one work group
dppy.barrier(dppy.CLK_LOCAL_MEM_FENCE) # local mem fence
# write
A[i] += lm[blocksize - 1 - i]
arr = np.arange(blocksize).astype(np.float32)
print(arr)
with dpctl.device_context("opencl:gpu") as gpu_queue:
reverse_array[blocksize, dppy.DEFAULT_LOCAL_SIZE](arr)
# there arr should be orig[::-1] + orig, i.e. [9, 9, 9, ...]
print(arr)
Note
The numba_dppy.barrier()
function is semantically equivalent to
numba.cuda.syncthreads
.