目录
一、错误核心原因
二、排查步骤
1. 检查当前驱动版本
2. 检查 CUDA 运行时版本
3. 验证驱动与 CUDA 的兼容性
三、解决方法
1. 确保驱动正确加载
2. 重新安装匹配的驱动与 CUDA
3. 验证环境正确性
四、关键注意事项
报错日志:
bash nccl.sh
------------5.安装nccl-test并测试-------------
Cloning into 'nccl-tests'...
remote: Enumerating objects: 504, done.
remote: Counting objects: 100% (347/347), done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 504 (delta 302), reused 206 (delta 194), pack-reused 157 (from 2)
Receiving objects: 100% (504/504), 188.86 KiB | 1.20 MiB/s, done.
Resolving deltas: 100% (341/341), done.
make -C src build BUILDDIR=/home/test/nccl-tests/build
make[1]: Entering directory '/home/test/nccl-tests/src'
Compiling timer.cc > /home/test/nccl-tests/build/timer.o
Compiling /home/test/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu