全面掌握Pandas时间序列处理：从基础到实战

时间序列数据在金融分析、物联网、商业智能等领域无处不在。作为Python数据分析的核心库，Pandas提供了强大而全面的时间序列处理功能。本文将系统介绍Pandas时间序列处理的各个方面，从基础概念到高级应用，帮助您在实际工作中高效处理时间序列数据。

一、时间序列基础概念

1.1 什么是时间序列数据

时间序列数据是按照时间顺序排列的一系列观测值，具有以下特点：

每个数据点都与特定时间戳相关联
数据点之间存在时间依赖性
通常具有趋势性、季节性和周期性等特征

典型应用场景包括：

股票价格分析
气象数据记录
网站流量监控
工业生产指标追踪

1.2 Pandas时间类型体系

Pandas构建了完整的时间类型体系来处理各种时间相关数据：

Timestamp：表示特定时间点（如"2023-01-01 12:00:00"），相当于Python的datetime但性能更高
Period：表示时间区间（如"2023年1月"）
Timedelta：表示时间间隔（如"3天5小时"）
DatetimeIndex：由Timestamp组成的索引，用于时间序列数据

import pandas as pd# 创建各种时间对象示例
timestamp = pd.Timestamp('2023-01-01 12:00:00')
period = pd.Period('2023-01', freq='M')
timedelta = pd.Timedelta(days=5, hours=3)

二、创建与转换时间序列

2.1 创建时间序列

Pandas提供了多种创建时间序列的方法：

# 从字符串列表创建
date_strings = ['2023-01-01', '2023-01-15', '2023-02-01']
dates = pd.to_datetime(date_strings)# 生成规则时间序列
daily = pd.date_range(start='2023-01-01', periods=7, freq='D')
monthly = pd.date_range(start='2023-01-01', periods=12, freq='M')# 创建带时间索引的DataFrame
df = pd.DataFrame({'value': [10, 20, 30, 40, 50]},index=pd.date_range('20230101', periods=5, freq='D')
)

2.2 时间数据转换

实际数据中时间可能有多种格式，需要进行标准化：

# 处理各种格式的日期字符串
mixed_dates = ['20230101', '2023/01/02', '01-03-2023']
uniform_dates = pd.to_datetime(mixed_dates)# 处理Unix时间戳
timestamps = [1672531200, 1672617600, 1672704000]
dates = pd.to_datetime(timestamps, unit='s')# 处理Excel日期数字
excel_dates = [44926, 44927, 44928]  # Excel中的日期数字
dates = pd.to_datetime(excel_dates, unit='D', origin='1899-12-30')

三、时间索引操作技巧

3.1 索引与切片

时间索引使数据访问变得直观高效：

# 创建示例数据
idx = pd.date_range('2023-01-01', periods=365)
ts = pd.Series(range(365), index=idx)# 按年切片
ts['2023']# 按年月切片
ts['2023-03']# 按日期范围切片
ts['2023-03-15':'2023-04-15']# 使用部分日期字符串
ts.loc['March 2023']
ts.loc['15th March 2023':'1st April 2023']

3.2 频率转换与重采样

重采样是时间序列分析的核心操作之一：

# 降采样示例（日数据→月数据）
monthly = ts.resample('M').mean()  # 每月平均值
monthly_sum = ts.resample('M').sum()  # 每月总和# 升采样示例（日数据→小时数据）
hourly = ts.resample('H').ffill()  # 前向填充
hourly_interp = ts.resample('H').interpolate()  # 线性插值# 自定义重采样规则
def last_n_days_mean(x, n=3):return x[-n:].mean()weekly_custom = ts.resample('W').apply(last_n_days_mean)

四、时间序列计算与特征工程

4.1 移动窗口计算

窗口计算是时间序列分析的常用技术：

# 简单移动平均
rolling_mean = ts.rolling(window=7).mean()  # 7天移动平均# 扩展窗口统计
expanding_max = ts.expanding().max()# 指数加权移动平均
ewma = ts.ewm(span=30).mean()# 组合使用
features = pd.DataFrame({'value': ts,'7d_avg': rolling_mean,'30d_ewma': ewma,'ratio': ts / rolling_mean
})

4.2 时间差与时移特征

# 计算时间差
time_deltas = ts.index.to_series().diff()# 创建时移特征
df['prev_day'] = df['value'].shift(1)  # 前一天的值
df['day_over_day'] = df['value'] / df['prev_day'] - 1  # 日环比# 周同比特征
df['last_week'] = df['value'].shift(7)
df['week_over_week'] = df['value'] / df['last_week'] - 1

五、高级时间序列处理技术

5.1 时区处理

全球化应用中时区处理至关重要：

# 本地化时区
ts = ts.tz_localize('UTC')# 转换时区
ts_eastern = ts.tz_convert('US/Eastern')# 处理夏令时转换
london_ts = pd.date_range('2023-03-25', '2023-03-27', freq='H')
df = pd.DataFrame(index=london_ts)
df = df.tz_localize('Europe/London', ambiguous='infer', nonexistent='shift')

5.2 节假日与工作日计算

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay# 获取节假日
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2023-01-01', end='2023-12-31')# 自定义工作日
us_bd = CustomBusinessDay(calendar=cal)
business_days = pd.date_range('2023-01-01', '2023-12-31', freq=us_bd)# 计算工作日差
def biz_days_between(start, end):return len(pd.date_range(start, end, freq=us_bd)) - 1

5.3 时间序列分解

from statsmodels.tsa.seasonal import seasonal_decompose# 加法模型分解
result_add = seasonal_decompose(ts, model='additive', period=30)# 乘法模型分解
result_mul = seasonal_decompose(ts, model='multiplicative', period=30)# 可视化分解结果
result_add.plot()
plt.show()

六、实战案例：股票数据分析

让我们通过一个完整的股票数据分析案例整合所学知识：

# 加载示例数据
import yfinance as yf
data = yf.download('AAPL', start='2020-01-01', end='2023-01-01')# 计算技术指标
data['20d_ma'] = data['Close'].rolling(20).mean()
data['50d_ma'] = data['Close'].rolling(50).mean()
data['200d_ma'] = data['Close'].rolling(200).mean()
data['daily_return'] = data['Close'].pct_change()# 重采样为月数据
monthly = data.resample('M').agg({'Open': 'first','High': 'max','Low': 'min','Close': 'last','Volume': 'sum'
})# 计算月收益率
monthly['monthly_return'] = monthly['Close'] / monthly['Close'].shift(1) - 1# 可视化
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(data['Close'], label='Daily Close')
plt.plot(data['200d_ma'], label='200-day MA')
plt.title('Apple Stock Price with Moving Average')
plt.legend()
plt.show()

七、性能优化与最佳实践

处理大规模时间序列数据时的优化技巧：

使用正确的时间类型：确保时间列是datetime64类型而非字符串
设置时间索引：将时间列设为索引以加速查询
合理选择频率：不需要高精度时降低频率节省内存
批量处理：避免循环，使用向量化操作
使用分类类型：对重复的时间特征（如小时、星期几）使用category类型

# 优化示例
df['date'] = pd.to_datetime(df['date'])  # 转换为datetime
df = df.set_index('date').sort_index()  # 设置有序时间索引# 将时间特征转换为分类变量
df['hour'] = df.index.hour.astype('category')
df['day_of_week'] = df.index.dayofweek.astype('category')