LLM基础1_语言模型如何处理文本

基于GitHub项目：https://github.com/datawhalechina/llms-from-scratch-cn

工具介绍

tiktoken：OpenAI开发的专业"分词器"
torch：Facebook开发的强力计算引擎，相当于超级计算器

理解词嵌入：给词语画"肖像"

传统方法：给每个词一个编号（就像学生学号）
词嵌入：给每个词画一幅多维画像（就像用颜色、形状、纹理描述一幅画），但是计算机理解不了这样的描述，所以我们转成多维向量来描述每个词

例如：

"国王" → [0.8, -0.2, 0.5]

"王后" → [0.7, -0.1, 0.6]

"苹果" → [0.1, 0.9, -0.3]

计算机通过计算向量相似度（如余弦相似度）就可以知道“国王”与“王后”关系更密切，与“苹果”没什么关系！

文本分词：把文章切成"词块"

(LLM对于长文本的识别能力限制，所以要切块）

第一步：读取文本

with open("sample.txt", "r", encoding="utf-8") as f:raw_text = f.read()

第二步：分词器切分文本

举例：“Hello world.”

# 简单切分：按空格切割
["Hello,", " ", "world."]

import re
# 专业切分：同时处理标点符号
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.?_!"()\']|--|\s)', text)
#正则表达式 r'([,.?_!"()\']|--|\s)' 匹配以下任意一种情况：
#单字符标点符号：,.?_!"()'
#双连字符 --
#空白字符 \s
#括号 () 将整个模式捕获为一个分组，确保 re.split 保留分隔符。
#空字符串 '' 出现在连续分隔符或字符串末尾时，表示相邻分隔符之间的空匹配。切割结果：['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

第三步：清理切割结果

# 清除空白碎片
result = [item.strip() for item in result if item.strip()]
#对于单个字符串，可使用strip()方法移除首尾空白

通过这种处理，计算机从"看到一堆字母"变成"理解词语之间的关系"。

将词元转换为词元IDs

举例解释这样做的目的：

老师记不住所有学生名字 → 用学号点名

计算机记不住所有词语 → 用数字ID处理文本

优势：

计算机处理数字比处理文本快得多
数字形式更适合数学计算（比如词嵌入）
统一格式方便存储和传输

第一步：收集所有词（去重）

all_words = sorted(list(set(preprocessed)))  # 用集合去重，列表装好所有词，sorted排序
vocab_size = len(all_words)

第二步：创建词语↔ID的映射表

# 像单词表一样：单词 → 序号
vocab = {"!":0, '"':1, "'":2, ... "He":50} # 同时也需要序号 → 单词的反向映射
int_to_str = {0:"!", 1:'"', 2:"'", ... 50:"He"}

举例：一个词对应一个特殊的ID，方便查找

词语	ID	词语	ID
!	0	HAD	46
"	1	Had	47
'	2	Hang	48
(	3	...	...

创建分词器

#基础版分词器
class SimpleTokenizerV1:def __init__(self, vocab):self.str_to_int = vocab  # 词语→IDself.int_to_str = {i:s for s,i in vocab.items()}  # ID→词语def encode(self, text):  # 文本 → 数字# 1. 分割文本为词语# 2. 清理空白# 3. 查表转换为IDreturn [self.str_to_int[s] for s in preprocessed]def decode(self, ids):  # 数字 → 文本# 1. ID转回词语# 2. 拼接成句子# 3. 修复标点空格return text

使用举例：

text = """"It's the last he painted, you know," Mrs. Gisburn said."""
ids = SimpleTokenizerV1.encode(text) 
# [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, ...]decoded = SimpleTokenizerV1.decode(ids)
# ""It's the last he painted, you know," Mrs. Gisburn said."""

处理特殊词元

（遇到某个新词没有在词汇表中）

举例：

text = "Hello, do you like tea?"
# 报错：KeyError: 'Hello'（Hello不在词汇表中）

解决方案：添加特殊词元

all_tokens.extend(["<|endoftext|>", "<|unk|>"])

特殊词元	作用
<\|unk\|>	未知词语
<\|endoftext\|>	文本结束标记
<\|bos\|>	文本开始标记
<\|pad\|>	填充对齐

#升级版分词器
class SimpleTokenizerV2:def __init__(self, vocab):self.str_to_int = vocab  # 词语→IDself.int_to_str = {i:s for s,i in vocab.items()}  # ID→词语def encode(self, text):# 遇到不认识的词就用<|unk|>代替preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]return [self.str_to_int[s] for s in preprocessed]#def encode(self, text):  # 文本 → 数字# 1. 分割文本为词语# 2. 清理空白# 3. 查表转换为ID#return [self.str_to_int[s] for s in preprocessed]def decode(self, ids):  # 数字 → 文本# 1. ID转回词语# 2. 拼接成句子# 3. 修复标点空格return text

#特殊词元的使用
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."# 用<|endoftext|>连接两段文本
text = text1 + " <|endoftext|> " + text2ids = tokenizer.encode(text)
# [1160, 5, 362, ... 1159, 57, 1013, ... 1160, 7]tokenizer.decode(ids)
# "Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace."
# 注意：Hello和palace被标记为<|unk|>

过程总结：