
基于新闻情绪的机器学习交易策略
News Sentiment and Machine Learning Investment Strategy
本文借助机器学习方法基于海量的媒体报道构造股票量化投资策略.首先通过迭代估计改进基于筛选和主题建模的文本情绪提取方法, 并通过蒙特卡罗模拟验证其在情绪提取准确度上的优势; 其次将其应用于2013-2020年间的沪深300指数成分股的超过100万篇新闻, 并构造股票投资策略. 研究结果表明:基于新闻情绪构造的交易策略在扣除交易费用后的年化收益率远超过同期市场指数收益; 在保证高时效性训练集的基础上, 迭代估计可以提高策略收益, 且这种提升在市场剧烈波动时期更为明显; 即使面对突发事件, 本文策略仍可以通过提高模型更新频率以获得可观收益. 拓展分析发现, 策略溢价是由新闻情绪的股票收益预测能力、不同资产的信息吸收速度差异所带来的.本文策略在小市值、低换手率和低Beta的股票中表现更好, 正是由于这部分股票的新闻吸收速度较慢, 为基于新闻情绪的机器学习交易策略提供了套利空间.
This paper uses machine learning method to construct quantitative investment strategy based on news. First, we use iterative estimation to improve the SESTM (sentiment extraction via screening and topic modeling), and Monte Carlo simulation verifies its advantages in the accuracy of sentiment extraction. Second, we apply it on over 1 million news articles related to the CSI 300 index stocks from 2013 to 2020, and then construct a stock investment strategy. Our results show that, the trading strategy based on the sentiment can obtain net excess returns that is far exceeding the market return, and the iterative estimation can improve its performance with timely training set in the period with higher market volatility. Even in the face of emergencies such as COVID-19, our strategy can still obtain benefits after including more timely news. This paper demonstrates that the economic intuition behind the strategy premium is that news sentiment can predict stock returns, and that there are differences in the speed of information absorption for different assets. The strategy in this paper performs better in stocks with small market capitalization, low turnover, and low beta. It is because of the slow absorption of news in these stocks, which provides arbitrage space for our machine learning strategies based on news sentiment.
文本分析 / 机器学习 / 量化投资 / 新闻情绪 {{custom_keyword}} /
text analysis / machine learning / quantitative investment / news sentiment {{custom_keyword}} /
表1 不同策略收益概况对比 |
累计收益 | 平均年化收益 | 曰标准差 | 年化标准差 | 夏普比率 | |
1 年训练窗口 | |||||
基准策略 | 3055.07% | 77.76% | 2.61% | 40.80% | 1.83 |
迭代策略 | 3250.65% | 79.55% | 2.61% | 40.79% | 1.87 |
2 年训练窗口 | |||||
基准策略 | 1151.94% | 52.38% | 2.90% | 45.30% | 1.08 |
迭代策略 | 1128.00% | 51.89% | 2.91% | 45.42% | 1.07 |
沪深 300 | 47.47% | 6.69% | 1.53% | 23.84% | 0.14 |
注: 表格报告了不同训练窗口与模型估计方法的四类策略、沪深300指数在累计收益、平均年化收益、日标准差、年化标准差以及夏普比率五个指标的对比结果, 预测区间均为2015年至2020年. 计算夏普比率时使用的无风险收益率为2015年至2020年10年期国债活跃券到期收益率的日度均值(3.2592%). |
表2 分年度策略收益概况对比 |
2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
基准策略 | 232.63% | -1.30% | 87.76% | 122.80% | 107.17% | 10.88% |
迭代策略 | 262.91% | -1.30% | 81.15% | 114.41% | 107.34% | 16.15% |
沪深 300 | 5.58% | -11.28% | 21.78% | -25.31% | 36.07% | 27.21% |
表3 月度更新模型收益概况(2020年) |
累计收益率 | H标准差 | 年化标准差 | 夏普比率 | |
月度更新-基准 | 34.47% | 2.56% | 39.98% | 0.79 |
月度更新-迭代 | 39.16% | 2.54% | 39.60% | 0.91 |
年度更新-迭代 | 16.15% | 1.85% | 28.93% | 0.46 |
沪深 300 | 27.21% | 1.43% | 22.41% | 1.08 |
注: 月度更新模型滚动使用前12个月的新闻文本进行模型训练; 年度更新模型使用2019年的新闻文本进行模型训练. 此处计算夏普比率时使用的无风险收益率为2020年10年期国债活跃券到期收益率的日度均值(2.9381%). |
表4 扣除交易费用前后的策略收益率对比 |
累计收益率 | 平均年化收益率 | 日标准差 | 年化标准差 | 夏普比率 | |
扣除交易费用前 | 3250.65% | 79.55% | 2.61% | 40.79% | 1.87 |
扣除交易费用后 | 944.25% | 47.84% | 2.61% | 40.80% | 1.09 |
注: 表中报告了在1年训练窗口的迭代模型基础上交易费用对收益率的影响. |
表5 策略换手率及净收益-情绪评分衰减法 |
η1 | 日均换手率 | 累计收益 | 年化收益 | 扣除交易费用后累计收益 | 扣除交易费用后年化收益 |
1 | 77.59% | 3250.65% | 79.55% | 979.40% | 48.66% |
0.9 | 67.97% | 699.00% | 41.39% | 195.94% | 19.82% |
0.8 | 67.26% | 818.76% | 44.72% | 243.89% | 22.86% |
0.7 | 66.22% | 925.30% | 47.39% | 287.81% | 25.34% |
0.6 | 65.49% | 992.56% | 48.96% | 319.75% | 27.01% |
0.5 | 64.33% | 1095.48% | 51.21% | 367.19% | 29.30% |
0.4 | 63.05% | 1039.57% | 50.01% | 353.69% | 28.66% |
0.3 | 61.88% | 1019.95% | 49.58% | 353.59% | 28.66% |
0.2 | 60.02% | 982.68% | 48.74% | 350.58% | 28.52% |
0.1 | 58.09% | 799.73% | 44.22% | 285.12% | 25.20% |
注: 表格报告了在1年训练窗口的迭代模型基础上, 使用不同参数的情绪评分衰减法得到的策略收益对比结果. 其中η1为新情绪评分的权重, 取值越低表示情绪衰减速度越慢、策略换手率越低. |
表6 策略换手率及净收益-组合权重衰减法 |
η2 | 日均换手率 | 累计收益 | 年化收益 | 扣除交易费用后累计收益 | 扣除交易费用后年化收益 |
1 | 77.59% | 3250.65% | 79.55% | 979.40% | 48.66% |
0.9 | 69.67% | 3274.42% | 79.76% | 1120.39% | 51.73% |
0.8 | 61.83% | 3147.77% | 78.62% | 1217.13% | 53.68% |
0.7 | 54.08% | 2898.28% | 76.26% | 1261.61% | 54.53% |
0.6 | 46.39% | 2562.70% | 72.80% | 1252.85% | 54.36% |
0.5 | 38.73% | 2173.11% | 68.31% | 1191.42% | 53.17% |
0.4 | 31.11% | 1747.39% | 62.59% | 1073.12% | 50.74% |
0.3 | 23.47% | 1289.61% | 55.05% | 886.39% | 46.45% |
0.2 | 15.80% | 808.64% | 44.45% | 621.36% | 39.00% |
0.1 | 8.03% | 358.48% | 28.89% | 307.70% | 26.39% |
注: 表格报告了在1年训练窗口的迭代模型基础上, 使用不同参数的组合权重衰减法得到的策略收益对比结果, 其中η2为新仓位的权重系数, 取值越低表示历史仓位变化越小、策略换手率越低. |
表7 样本内预测能力检验结果 |
α | Std. Error | β | Std. Error | R2 |
-1.047 | 0.320 | 2.201 | 0.651 | 0.008 |
表8 样本外预测能力检验结果 |
初始窗口 | P=100 | P=200 | P=300 |
Ros2 | 15.81% | 24.34% | 12.74% |
表9 不同策略收益与因子模型 |
基准策略 | 迭代策略 | 扣除交易费用后的迭代策略 | 基准策略 | 迭代策略 | 扣除交易费用后的迭代策略 | |
α | 0.002*** | 0.002*** | 0.002*** | 0.003*** | 0.003*** | 0.002*** |
(0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
MKT | 0.710*** | 0.705*** | 0.706*** | 0.697*** | 0.694*** | 0.695*** |
(0.040) | (0.040) | (0.040) | (0.041) | (0.041) | (0.041) | |
SMB | -0.329*** | -0.345*** | -0.343*** | -0.480*** | -0.463*** | -0.463*** |
(0.096) | (0.096) | (0.096) | (0.137) | (0.138) | (0.138) | |
HML | -0.762*** | -0.767*** | -0.764*** | -0.669*** | -0.674*** | -0.671*** |
(0.121) | (0.122) | (0.122) | (0.134) | (0.135) | (0.135) | |
RMW | -0.488** | -0.421** | -0.423** | |||
(0.197) | (0.197) | (0.198) | ||||
CMA | -0.373** | -0.358* | -0.358* | |||
(0.188) | (0.189) | (0.189) | ||||
R2 | 0.250 | 0.246 | 0.246 | 0.253 | 0.249 | 0.249 |
N | 1462 | 1462 | 1462 | 1462 | 1462 | 1462 |
注:*、**、***分别表示在10%、5%和1%水平上显著,括号内为标准误. |
表10 使用不同时段新闻的策略收益率对比 |
新闻时段 | 年化收益率 |
全部新闻(T-1日9:00至T日9:00) | 79.55% |
盘中新闻(T-1日9:00至T-1日15:00) | 24.76% |
隔夜新闻(T-1日15:00至T日9:00) | 49.89% |
姜富伟, 孟令超, 唐国豪, 媒体文本情绪与股票回报预测[J]. 经济学(季刊), 2021, 21 (4): 1323- 1344.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
李斌, 邵新月, 李玥阳, 机器学习驱动的基本面量化投资研究[J]. 中国工业经济, 2019, (8): 61- 79.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
李志生, 金凌, 张知宸, 危机时期政府直接干预与尾部系统风险——来自2015年股灾期间"国家队"持股的证据[J]. 经济研究, 2019, 54 (4): 67- 83.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
林建浩, 陈良源, 宋登辉, 如何测度央行行长的口头沟通信息——一种基于监督学习的文本分析方法[J]. 统计研究, 2019, 36 (8): 3- 18.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
马甜, 姜富伟, 唐国豪, 深度学习与中国股票市场因子投资——基于生成式对抗网络方法[J]. 经济学(季刊), 2022, 22 (3): 819- 842.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
唐文进, 苏帆, 极端金融事件对系统性风险的影响分析——以中国银行部门为例[J]. 经济研究, 2017, 52 (4): 17- 33.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
沈艳, 陈赟, 黄卓, 文本大数据分析在经济学和金融学中的应用: 一个文献综述[J]. 经济学(季刊), 2019, 18 (4): 1153- 86.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
Jiang J, Kelly B T, Xiu D, (2020). (Re-)imag(in)ing Price Trends[D]. Chicago: Chicago Booth Research Paper.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
Ke Z T, Kelly B T, Xiu D, (2019). Predicting Returns with Text Data[R]. National Bureau of Economic Research.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
Kelly B T, Pruitt S, Su Y, (2019). Instrumented Principal Component Analysis[R]. Working Paper.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
/
〈 |
|
〉 |