CUDA 中Thrust exclusive_scan使用详解

1. 基本概念

Thrust 是 NVIDIA CUDA 提供的类似 C++ STL 的并行算法库。
Scan (前缀和)：给定数组 [a0, a1, a2, ...]，产生前缀和序列。
Exclusive Scan (排他前缀和)：
- 输出位置 i 存放的是输入数组中 0 到 i-1 的累积结果。
- 换句话说，结果向右错位一格，首位放初始值。

公式：

output[0] = init
output[i] = input[0] ⊕ input[1] ⊕ ... ⊕ input[i-1],  (i > 0)

其中 ⊕ 是二元操作符（默认是加法）。

和 inclusive_scan 的区别：

inclusive：包含当前位置 → output[i] = input[0] + ... + input[i]
exclusive：不包含当前位置 → output[i] = input[0] + ... + input[i-1]

2. 函数原型

// 版本 1：默认加法
template <typename InputIterator, typename OutputIterator>
OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result);// 版本 2：指定初始值
template <typename InputIterator, typename OutputIterator, typename T>
OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init);// 版本 3：指定二元运算符
template <typename InputIterator, typename OutputIterator, typename T, typename BinaryFunction>
OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result, T init, BinaryFunction binary_op);

参数说明：

first, last：输入区间
result：输出区间
init：初始值（默认 0）
binary_op：二元操作（默认 thrust::plus<T>()）

返回值：输出迭代器（即 result + (last - first)）

3. 示例代码

🔹 默认加法 + 默认初始值

#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <iostream>int main() {thrust::device_vector<int> d_in{1, 2, 3, 4};thrust::device_vector<int> d_out(4);thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin());// 输出：0 1 3 6for (int x : d_out) std::cout << x << " ";
}

解释：

输入 [1, 2, 3, 4]
输出 [0, 1, 3, 6]

指定初始值

thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin(), 10);

输入 [1, 2, 3, 4]，结果 [10, 11, 13, 16]

自定义运算符（乘法）

struct multiply
{__host__ __device__ int operator()(const int &a, const int &b) const {return a * b;}
};thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin(), 1, multiply());

输入 [1, 2, 3, 4]，结果 [1, 1, 2, 6]

4. 内部实现原理

exclusive_scan 在 GPU 上通常使用 并行前缀和算法（Blelloch Scan）：
1. 上行阶段（reduce）：树形归约，计算部分和。
2. 下行阶段（down-sweep）：回传累积和，得到 exclusive 结果。
复杂度：O(n)
并行化效率：在 CUDA 中可利用 warp shuffle / shared memory 优化。

5. 常见应用

数组索引计算：比如稀疏矩阵非零元素位置。
并行 compact/filter：布尔标记数组 → 前缀和 → 计算输出位置。
动态内存分配：统计每个线程写入位置。
图算法：CSR 格式构建、邻接表索引。

6. 易错点

exclusive vs inclusive：结果差一个位置，常见 bug。
初始值：没设置时默认 0，很多人误以为会报错。
in-place 使用：输入和输出可以是同一个 vector（安全）。
自定义运算符：必须满足结合律（associative），否则结果不稳定。

7. 性能优化

exclusive_scan 已由 Thrust 内部优化，适合中大规模数组。
小规模数组时，CPU std::exclusive_scan 可能更快。
多次调用时建议 预分配 device_vector，避免反复分配内存。

8、 `exclusive_scan` vs `inclusive_scan` 对比表

特性	`exclusive_scan`	`inclusive_scan`
定义	输出位置 `i` 是输入 `[0..i-1]` 的累积和，当前位置不包含	输出位置 `i` 是输入 `[0..i]` 的累积和，当前位置包含
首元素	结果 `output[0] = init`（默认 0）	结果 `output[0] = input[0]`
数学表达式	`out[i] = init ⊕ in[0] ⊕ ... ⊕ in[i-1]`	`out[i] = in[0] ⊕ in[1] ⊕ ... ⊕ in[i]`
输入 `[1, 2, 3, 4]`，init=0	输出 `[0, 1, 3, 6]`	输出 `[1, 3, 6, 10]`
典型用途	计算索引偏移、数组 compaction、CSR 图结构	直接得到前缀累计量，例如直方图累计、累积分布函数 (CDF)

9、典型应用场景代码

1️、数组索引偏移（exclusive_scan）

常见于 稀疏矩阵 / 图的 CSR 格式构建

#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <iostream>int main() {thrust::device_vector<int> flags{1, 0, 1, 1, 0};thrust::device_vector<int> indices(flags.size());// exclusive_scan：计算每个位置在输出数组的起始索引thrust::exclusive_scan(flags.begin(), flags.end(), indices.begin());// 输出: 0 1 1 2 3for (int x : indices) std::cout << x << " ";
}

解释：

flags 表示某位置是否有效
exclusive_scan 得到每个有效元素在输出数组中的索引位置

2️、数组 compaction（exclusive_scan + scatter）

保留布尔条件为真的元素，常用于 过滤数据

#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <iostream>int main() {thrust::device_vector<int> data{3, 7, 0, 4, 9};thrust::device_vector<int> flags{1, 0, 1, 0, 1};thrust::device_vector<int> indices(flags.size());thrust::device_vector<int> result(3); // 预估有效数量thrust::exclusive_scan(flags.begin(), flags.end(), indices.begin());// 根据 flags 把有效元素写入结果数组for (int i = 0; i < flags.size(); i++) {if (flags[i]) result[indices[i]] = data[i];}// 输出: 3 0 9for (int x : result) std::cout << x << " ";
}

3️、累积分布函数 CDF（inclusive_scan）

常用于 直方图后处理

#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <iostream>int main() {thrust::device_vector<int> hist{2, 3, 5, 4};thrust::device_vector<int> cdf(hist.size());thrust::inclusive_scan(hist.begin(), hist.end(), cdf.begin());// 输出: 2 5 10 14for (int x : cdf) std::cout << x << " ";
}

解释：

输入直方图 [2,3,5,4]
inclusive_scan 得到累积分布 [2,5,10,14]

4️、并行前缀乘积（inclusive_scan with multiply）

用于 概率链式计算

struct multiply {__host__ __device__ float operator()(float a, float b) const {return a * b;}
};thrust::device_vector<float> probs{0.9, 0.8, 0.7, 0.6};
thrust::device_vector<float> cumprod(probs.size());thrust::inclusive_scan(probs.begin(), probs.end(), cumprod.begin(), multiply());// 输入: 0.9 0.8 0.7 0.6
// 输出: 0.9 0.72 0.504 0.3024

10、总结

exclusive_scan：计算“排他前缀和”，结果右移一位，首位为初始值。
三种重载：默认 / 指定 init / 指定 init + 运算符。
应用广泛：稀疏矩阵、图算法、并行 compaction。
注意事项：exclusive vs inclusive，init 值，自定义运算符必须结合律。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/diannao/99230.shtml
繁体地址，请注明出处：http://hk.pswp.cn/diannao/99230.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！