Synchronization FunctionsΒΆ

Currently, numba-dppy only supports some of the SYCL synchronization operations. For synchronization of all threads in the same thread block, numba-dppy provides a helper function called numba_dppy.barrier(). This function implements the same pattern as barriers in traditional multi-threaded programming: invoking the function forces a thread to wait until all threads in the block reach the barrier, at which point it returns control to all its callers.

numba_dppy.barrier() supports two memory fence options:

  • numba_dppy.CLK_GLOBAL_MEM_FENCE: The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. Using the option can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data. Passing no arguments to numba_dppy.barrier() is equivalent to setting the global memory fence option. For example,

    def no_arg_barrier_support():
        # @dppy.kernel("void(float32[::1])")
        @dppy.kernel
        def twice(A):
            i = dppy.get_global_id(0)
            d = A[i]
            # no argument defaults to global mem fence
            dppy.barrier()
            A[i] = d * 2
    
        N = 10
        arr = np.arange(N).astype(np.float32)
        print(arr)
    
        with dpctl.device_context("opencl:gpu") as gpu_queue:
            twice[N, dppy.DEFAULT_LOCAL_SIZE](arr)
    
        # there arr should be original arr * 2, i.e. [0, 2, 4, 6, ...]
        print(arr)
    
  • numba_dppy.CLK_LOCAL_MEM_FENCE: The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory. For example,

def local_memory():
    blocksize = 10

    # @dppy.kernel("void(float32[::1])")
    @dppy.kernel
    def reverse_array(A):
        lm = dppy.local.array(shape=10, dtype=float32)
        i = dppy.get_global_id(0)

        # preload
        lm[i] = A[i]
        # barrier local or global will both work as we only have one work group
        dppy.barrier(dppy.CLK_LOCAL_MEM_FENCE)  # local mem fence
        # write
        A[i] += lm[blocksize - 1 - i]

    arr = np.arange(blocksize).astype(np.float32)
    print(arr)

    with dpctl.device_context("opencl:gpu") as gpu_queue:
        reverse_array[blocksize, dppy.DEFAULT_LOCAL_SIZE](arr)

    # there arr should be orig[::-1] + orig, i.e. [9, 9, 9, ...]
    print(arr)

Note

The numba_dppy.barrier() function is semantically equivalent to numba.cuda.syncthreads.