Title: Machine Learning Performance Engineer
Location: San Ramon, CA preferred with 50% travel
Duration: 6+ Months contract
Skills Required: ML+CUDA + Python
Description
Must be willing to travel to customer sites. Job Responsibilities include CUDA installation/configuration/tuning issues and slowing down the adoption of the technology. These experts will help us fix these issues.
Requirements:
- An understanding of modern ML techniques and toolsets
- The experience and systems knowledge required to debug a training run's performance end to end
- Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
- Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
- Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
- Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
- Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters
- An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
- An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools.