Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Machine Learning for

Algorithmic Trading
Second Edition

Predictive models to extract signals from market


and alternative data for systematic trading
strategies with Python

Stefan Jansen

BIRMINGHAM - MUMBAI
Table of Contents
Prefacexiii
Chapter 1: Machine Learning for Trading – From Idea to Execution 1
The rise of ML in the investment industry 2
From electronic to high-frequency trading 3
Factor investing and smart beta funds 5
Algorithmic pioneers outperform humans 7
ML and alternative data 10
Crowdsourcing trading algorithms 11
Designing and executing an ML-driven strategy 12
Sourcing and managing data 13
From alpha factor research to portfolio management 13
Strategy backtesting 15
ML for trading – strategies and use cases 15
The evolution of algorithmic strategies 15
Use cases of ML for trading 16
Summary19
Chapter 2: Market and Fundamental Data – Sources and Techniques 21
Market data reflects its environment 22
Market microstructure – the nuts and bolts 23
How to trade – different types of orders 23
Where to trade – from exchanges to dark pools 24
Working with high-frequency data 26
How to work with Nasdaq order book data 26
Communicating trades with the FIX protocol 27
The Nasdaq TotalView-ITCH data feed 27
From ticks to bars – how to regularize market data 35
AlgoSeek minute bars – equity quote and trade data 40
API access to market data 44
Remote data access using pandas 44
yfinance – scraping data from Yahoo! Finance 46
[i]
Table of Contents

Quantopian48
Zipline48
Quandl50
Other market data providers 50
How to work with fundamental data 51
Financial statement data 51
Other fundamental data sources 56
Efficient data storage with pandas 57
Summary58
Chapter 3: Alternative Data for Finance – Categories and Use Cases 59
The alternative data revolution 60
Sources of alternative data 62
Individuals62
Business processes 63
Sensors63
Criteria for evaluating alternative data 65
Quality of the signal content 65
Quality of the data 67
Technical aspects 68
The market for alternative data 69
Data providers and use cases 70
Working with alternative data 72
Scraping OpenTable data 72
Scraping and parsing earnings call transcripts 77
Summary80
Chapter 4: Financial Feature Engineering – How to Research
Alpha Factors 81
Alpha factors in practice – from data to signals  82
Building on decades of factor research 84
Momentum and sentiment – the trend is your friend 84
Value factors – hunting fundamental bargains 88
Volatility and size anomalies 90
Quality factors for quantitative investing 92
Engineering alpha factors that predict returns 94
How to engineer factors using pandas and NumPy 94
How to use TA-Lib to create technical alpha factors 99
Denoising alpha factors with the Kalman filter 100
How to preprocess your noisy signals using wavelets 104
From signals to trades – Zipline for backtests  106
How to backtest a single-factor strategy 106
Combining factors from diverse data sources 109
Separating signal from noise with Alphalens 111
Creating forward returns and factor quantiles 112
Predictive performance by factor quantiles 113
[ ii ]
Table of Contents

The information coefficient 115


Factor turnover 117
Alpha factor resources 118
Alternative algorithmic trading libraries 118
Summary119
Chapter 5: Portfolio Optimization and Performance Evaluation 121
How to measure portfolio performance 122
Capturing risk-return trade-offs in a single number 122
The fundamental law of active management 124
How to manage portfolio risk and return 125
The evolution of modern portfolio management 125
Mean-variance optimization 127
Alternatives to mean-variance optimization 131
Risk parity 134
Risk factor investment 135
Hierarchical risk parity 135
Trading and managing portfolios with Zipline 136
Scheduling signal generation and trade execution 137
Implementing mean-variance portfolio optimization 138
Measuring backtest performance with pyfolio 140
Creating the returns and benchmark inputs 141
Walk-forward testing – out-of-sample returns 142
Summary146
Chapter 6: The Machine Learning Process 147
How machine learning from data works 148
The challenge – matching the algorithm to the task 149
Supervised learning – teaching by example 149
Unsupervised learning – uncovering useful patterns 150
Reinforcement learning – learning by trial and error 152
The machine learning workflow 153
Basic walkthrough – k-nearest neighbors 154
Framing the problem – from goals to metrics 154
Collecting and preparing the data 160
Exploring, extracting, and engineering features 160
Selecting an ML algorithm 162
Design and tune the model 162
How to select a model using cross-validation 165
How to implement cross-validation in Python 166
Challenges with cross-validation in finance 168
Parameter tuning with scikit-learn and Yellowbrick 170
Summary 172
Chapter 7: Linear Models – From Risk Factors to Return Forecasts 173
From inference to prediction 174

[ iii ]
Table of Contents

The baseline model – multiple linear regression 175


How to formulate the model 175
How to train the model 176
The Gauss–Markov theorem 179
How to conduct statistical inference 180
How to diagnose and remedy problems 181
How to run linear regression in practice 184
OLS with statsmodels 184
Stochastic gradient descent with sklearn 186
How to build a linear factor model 187
From the CAPM to the Fama–French factor models 188
Obtaining the risk factors 189
Fama–Macbeth regression 191
Regularizing linear regression using shrinkage 194
How to hedge against overfitting 194
How ridge regression works 195
How lasso regression works 196
How to predict returns with linear regression 197
Preparing model features and forward returns 197
Linear OLS regression using statsmodels 203
Linear regression using scikit-learn 205
Ridge regression using scikit-learn 208
Lasso regression using sklearn 210
Comparing the quality of the predictive signals 212
Linear classification 212
The logistic regression model 213
How to conduct inference with statsmodels 215
Predicting price movements with logistic regression  217
Summary219
Chapter 8: The ML4T Workflow –
From Model to Strategy Backtesting 221
How to backtest an ML-driven strategy 222
Backtesting pitfalls and how to avoid them 223
Getting the data right 224
Getting the simulation right 225
Getting the statistics right 226
How a backtesting engine works 227
Vectorized versus event-driven backtesting 228
Key implementation aspects 230
backtrader – a flexible tool for local backtests 232
Key concepts of backtrader's Cerebro architecture 232
How to use backtrader in practice 235
backtrader summary and next steps 239
Zipline – scalable backtesting by Quantopian 239

[ iv ]
Table of Contents

Calendars and the Pipeline for robust simulations 240


Ingesting your own bundles with minute data 242
The Pipeline API – backtesting an ML signal 245
How to train a model during the backtest 250
Instead of How to use 254
Summary254
Chapter 9: Time-Series Models for Volatility Forecasts and
Statistical Arbitrage 255
Tools for diagnostics and feature extraction 256
How to decompose time-series patterns 257
Rolling window statistics and moving averages 258
How to measure autocorrelation 259
How to diagnose and achieve stationarity 260
Transforming a time series to achieve stationarity 261
Handling instead of How to handle 261
Time-series transformations in practice 263
Univariate time-series models 265
How to build autoregressive models 266
How to build moving-average models 267
How to build ARIMA models and extensions 268
How to forecast macro fundamentals 270
How to use time-series models to forecast volatility 272
Multivariate time-series models 276
Systems of equations 277
The vector autoregressive (VAR) model 277
Using the VAR model for macro forecasts 278
Cointegration – time series with a shared trend 281
The Engle-Granger two-step method 282
The Johansen likelihood-ratio test 282
Statistical arbitrage with cointegration 283
How to select and trade comoving asset pairs 283
Pairs trading in practice 285
Preparing the strategy backtest 288
Backtesting the strategy using backtrader 292
Extensions – how to do better 294
Summary294
Chapter 10: Bayesian ML – Dynamic Sharpe Ratios
and Pairs Trading 295
How Bayesian machine learning works 296
How to update assumptions from empirical evidence 297
Exact inference – maximum a posteriori estimation 298
Deterministic and stochastic approximate inference 301
Probabilistic programming with PyMC3 305
Bayesian machine learning with Theano 305
[v]
Table of Contents

The PyMC3 workflow: predicting a recession 305


Bayesian ML for trading 317
Bayesian Sharpe ratio for performance comparison 317
Bayesian rolling regression for pairs trading 320
Stochastic volatility models 323
Summary326
Chapter 11: Random Forests – A Long-Short Strategy
for Japanese Stocks 327
Decision trees – learning rules from data 328
How trees learn and apply decision rules 328
Decision trees in practice 330
Overfitting and regularization 336
Hyperparameter tuning 338
Random forests – making trees more reliable 345
Why ensemble models perform better 345
Bootstrap aggregation 346
How to build a random forest 349
How to train and tune a random forest 350
Feature importance for random forests 352
Out-of-bag testing 352
Pros and cons of random forests 353
Long-short signals for Japanese stocks 353
The data – Japanese equities 354
The ML4T workflow with LightGBM 355
The strategy – backtest with Zipline 362
Summary364
Chapter 12: Boosting Your Trading Strategy 365
Getting started – adaptive boosting 366
The AdaBoost algorithm 367
Using AdaBoost to predict monthly price moves 368
Gradient boosting – ensembles for most tasks 370
How to train and tune GBM models 372
How to use gradient boosting with sklearn 374
Using XGBoost, LightGBM, and CatBoost 378
How algorithmic innovations boost performance 379
A long-short trading strategy with boosting 383
Generating signals with LightGBM and CatBoost 383
Inside the black box - interpreting GBM results 391
Backtesting a strategy based on a boosting ensemble 399
Lessons learned and next steps 401
Boosting for an intraday strategy 402
Engineering features for high-frequency data 402
Minute-frequency signals with LightGBM 404
Evaluating the trading signal quality 405
[ vi ]
Table of Contents

Summary406
Chapter 13: Data-Driven Risk Factors and Asset Allocation with
Unsupervised Learning 407
Dimensionality reduction 408
The curse of dimensionality 409
Linear dimensionality reduction 411
Manifold learning – nonlinear dimensionality reduction 418
PCA for trading 421
Data-driven risk factors 421
Eigenportfolios424
Clustering426
k-means clustering 427
Hierarchical clustering 429
Density-based clustering 431
Gaussian mixture models 432
Hierarchical clustering for optimal portfolios 433
How hierarchical risk parity works 433
Backtesting HRP using an ML trading strategy 435
Summary438
Chapter 14: Text Data for Trading – Sentiment Analysis 439
ML with text data – from language to features 440
Key challenges of working with text data 440
The NLP workflow 441
Applications443
From text to tokens – the NLP pipeline 443
NLP pipeline with spaCy and textacy 444
NLP with TextBlob 448
Counting tokens – the document-term matrix 449
The bag-of-words model 450
Document-term matrix with scikit-learn 451
Key lessons instead of lessons learned 455
NLP for trading 455
The naive Bayes classifier 456
Classifying news articles 457
Sentiment analysis with Twitter and Yelp data 458
Summary462
Chapter 15: Topic Modeling – Summarizing Financial News 463
Learning latent topics – Goals and approaches 464
Latent semantic indexing 465
How to implement LSI using sklearn 466
Strengths and limitations 468
Probabilistic latent semantic analysis 469
How to implement pLSA using sklearn 470

[ vii ]
Table of Contents

Strengths and limitations 471


Latent Dirichlet allocation 471
How LDA works 471
How to evaluate LDA topics 473
How to implement LDA using sklearn 475
How to visualize LDA results using pyLDAvis 475
How to implement LDA using Gensim 476
Modeling topics discussed in earnings calls 478
Data preprocessing 478
Model training and evaluation 479
Running experiments 480
Topic modeling for with financial news 481
Summary482
Chapter 16: Word Embeddings for Earnings Calls and SEC Filings 483
How word embeddings encode semantics 484
How neural language models learn usage in context 485
word2vec – scalable word and phrase embeddings 485
Evaluating embeddings using semantic arithmetic  487
How to use pretrained word vectors 489
GloVe – Global vectors for word representation 489
Custom embeddings for financial news 491
Preprocessing – sentence detection and n-grams 492
The skip-gram architecture in TensorFlow 2 493
Visualizing embeddings using TensorBoard 496
How to train embeddings faster with Gensim 497
word2vec for trading with SEC filings 499
Preprocessing – sentence detection and n-grams 500
Model training 501
Sentiment analysis using doc2vec embeddings 503
Creating doc2vec input from Yelp sentiment data 503
Training a doc2vec model 504
Training a classifier with document vectors 505
Lessons learned and next steps 507
New frontiers – pretrained transformer models 507
Attention is all you need  508
BERT – towards a more universal language model 509
Trading on text data – lessons learned and next steps 511
Summary511
Chapter 17: Deep Learning for Trading 513
Deep learning – what's new and why it matters 514
Hierarchical features tame high-dimensional data 515
DL as representation learning 516
How DL relates to ML and AI 517
Designing an NN 518
[ viii ]
Table of Contents

A simple feedforward neural network architecture 519


Key design choices 520
How to regularize deep NNs 522
Training faster – optimizations for deep learning 523
Summary – how to tune key hyperparameters 525
A neural network from scratch in Python 526
The input layer 526
The hidden layer 527
The output layer 528
Forward propagation 529
The cross-entropy cost function 529
How to implement backprop using Python 529
Popular deep learning libraries 534
Leveraging GPU acceleration 534
How to use TensorFlow 2 535
How to use TensorBoard 537
How to use PyTorch 1.4 538
Alternative options 541
Optimizing an NN for a long-short strategy 542
Engineering features to predict daily stock returns 542
Defining an NN architecture framework 542
Cross-validating design options to tune the NN 543
Evaluating the predictive performance 545
Backtesting a strategy based on ensembled signals 547
How to further improve the results 549
Summary549
Chapter 18: CNNs for Financial Time Series and Satellite Images 551
How CNNs learn to model grid-like data 552
From hand-coding to learning filters from data 553
How the elements of a convolutional layer operate 554
The evolution of CNN architectures: key innovations 558
CNNs for satellite images and object detection 559
LeNet5 – The first CNN with industrial applications 560
AlexNet – reigniting deep learning research 563
Transfer learning – faster training with less data 565
Object detection and segmentation 573
Object detection in practice 573
CNNs for time-series data – predicting returns 577
An autoregressive CNN with 1D convolutions 577
CNN-TA – clustering time series in 2D format 581
Summary589
Chapter 19: RNNs for Multivariate Time Series and
Sentiment Analysis 591
How recurrent neural nets work 592
[ ix ]
Table of Contents

Unfolding a computational graph with cycles 594


Backpropagation through time 594
Alternative RNN architectures 595
How to design deep RNNs 596
The challenge of learning long-range dependencies 597
Gated recurrent units 599
RNNs for time series with TensorFlow 2 599
Univariate regression – predicting the S&P 500 600
How to get time series data into shape for an RNN 600
Stacked LSTM – predicting price moves and returns 605
Multivariate time-series regression for macro data 611
RNNs for text data  614
LSTM with embeddings for sentiment classification 614
Sentiment analysis with pretrained word vectors 617
Predicting returns from SEC filing embeddings 619
Summary624
Chapter 20: Autoencoders for Conditional Risk Factors
and Asset Pricing 625
Autoencoders for nonlinear feature extraction 626
Generalizing linear dimensionality reduction 626
Convolutional autoencoders for image compression 627
Managing overfitting with regularized autoencoders 628
Fixing corrupted data with denoising autoencoders 628
Seq2seq autoencoders for time series features 629
Generative modeling with variational autoencoders  629
Implementing autoencoders with TensorFlow 2 630
How to prepare the data 630
One-layer feedforward autoencoder 631
Feedforward autoencoder with sparsity constraints 634
Deep feedforward autoencoder 634
Convolutional autoencoders 636
Denoising autoencoders 637
A conditional autoencoder for trading 638
Sourcing stock prices and metadata information 639
Computing predictive asset characteristics 641
Creating the conditional autoencoder architecture 643
Lessons learned and next steps 648
Summary648
Chapter 21: Generative Adversarial Networks for Synthetic
Time-Series Data 649
Creating synthetic data with GANs 650
Comparing generative and discriminative models 651
Adversarial training – a zero-sum game of trickery 651
The rapid evolution of the GAN architecture zoo 652
[x]
Table of Contents

GAN applications to images and time-series data 653


How to build a GAN using TensorFlow 2 655
Building the generator network 655
Creating the discriminator network 656
Setting up the adversarial training process 657
Evaluating the results 660
TimeGAN for synthetic financial data 660
Learning to generate data across features and time 661
Implementing TimeGAN using TensorFlow 2 663
Evaluating the quality of synthetic time-series data 672
Lessons learned and next steps 678
Summary 678
Chapter 22: Deep Reinforcement Learning –
Building a Trading Agent 679
Elements of a reinforcement learning system 680
The policy – translating states into actions 681
Rewards – learning from actions 681
The value function – optimal choice for the long run 682
With or without a model – look before you leap? 682
How to solve reinforcement learning problems 682
Key challenges in solving RL problems 683
Fundamental approaches to solving RL problems 683
Solving dynamic programming problems 684
Finite Markov decision problems 684
Policy iteration 687
Value iteration 688
Generalized policy iteration 688
Dynamic programming in Python 689
Q-learning – finding an optimal policy on the go 694
𝛆𝛆
Exploration versus exploitation – -greedy policy 695
The Q-learning algorithm 695
How to train a Q-learning agent using Python 695
Deep RL for trading with the OpenAI Gym 696
Value function approximation with neural networks 697
The Deep Q-learning algorithm and extensions 697
Introducing the OpenAI Gym 699
How to implement DDQN using TensorFlow 2 700
Creating a simple trading agent 704
How to design a custom OpenAI trading environment 705
Deep Q-learning on the stock market 709
Lessons learned 711
Summary 711
Chapter 23: Conclusions and Next Steps 713
Key takeaways and lessons learned 714
[ xi ]
Table of Contents

Data is the single most important ingredient 715


Domain expertise – telling the signal from the noise 716
ML is a toolkit for solving problems with data 717
Beware of backtest overfitting 719
How to gain insights from black-box models 719
ML for trading in practice 720
Data management technologies 720
ML tools 722
Online trading platforms 722
Conclusion 723
Appendix: Alpha Factor Library 725
Common alpha factors implemented in TA-Lib 726
A key building block – moving averages 726
Overlap studies – price and volatility trends 729
Momentum indicators 733
Volume and liquidity indicators 741
Volatility indicators 743
Fundamental risk factors 744
WorldQuant's quest for formulaic alphas 745
Cross-sectional and time-series functions 745
Formulaic alpha expressions 747
Bivariate and multivariate factor evaluation 749
Information coefficient and mutual information 749
Feature importance and SHAP values 750
Comparison – the top 25 features for each metric 750
Financial performance – Alphalens 752
References 753
Index 769

[ xii ]

You might also like