Python for Finance, Part I: Yahoo Finance API, pandas, and matplotlib
Less than a decade ago, financial instruments called derivatives were at the height of popularity. Financial institutions around the world were trading billions of dollars of these instruments on a daily basis, and quantitative analysts were modeling them using stochastic calculus and the all mighty C++.
Fast forward nine years later and things have changed. The financial crisis has proven to be an as-to-yet derivatives-nemesis. Volumes have gone down and demand for C++ modeling has withered with them. But there is a new player in town… Python!
Python has been gaining significant traction in the financial industry over the last years and with good reason. In this series of tutorials we are going to see how one can leverage the powerful functionality provided by a number of Python packages to develop and backtest a quantitative trading strategy.
In detail, in the first of our tutorials, we are going to show how one can easily use Python to download financial data from free online databases, manipulate the downloaded data and then create some basic technical indicators which will then be used as the basis of our quantitative strategy. To accomplish that, we are going to use one of the most powerful and widely used Python packages for data manipulation, pandas.
Getting the Data
Pandas is included in the more popular distributions of Python for Windows, such as Anaconda. In case it's not included in your Python distribution, use pip:
pip install pandas. Once installed, to use pandas, all one needs to do is import it. We will also need the
pandas_datareader package (
pip install pandas-datareader), as well as
pip install matplotlib)for visualizing our results.
from pandas_datareader import data import matplotlib.pyplot as plt import pandas as pd
Having imported the appropriate tools, getting market data from a free online source, such as Yahoo Finance, is super easy. Since pandas has a simple remote data access for the Yahoo Finance API data, this is as simple as:
# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index. tickers = ['AAPL', 'MSFT', '^GSPC'] # Define which online source one should use data_source = 'yahoo' # We would like all available data from 01/01/2000 until 12/31/2016. start_date = '2000-01-01' end_date = '2016-12-31' # User pandas_reader.data.DataReader to load the desired data. As simple as that. panel_data = data.DataReader(tickers, data_source, start_date, end_date)
panel_data look like?
data.DataReader returns a
Panel object, which can be thought of as a 3D matrix. The first dimension consists of the various fields Yahoo Finance returns for a given instrument, namely, the Open, High, Low, Close and Adj Close prices for each date. The second dimension contain the dates. The third one contains the instrument identifiers.
Let's see what
panel_data actually is:
>>> panel_data <class 'pandas.core.panel.Panel'> Dimensions: 6 (items) x 4277 (major_axis) x 3 (minor_axis) Items axis: Open to Adj Close Major_axis axis: 2000-01-03 00:00:00 to 2016-12-30 00:00:00 Minor_axis axis: AAPL to ^GSPC
Preparing the Data
Let us assume we are interested in working with the Adj Close prices which have been already been adjusted by Yahoo finance to account for corporate actions such as dividends and stock splits. We want to make sure that all weekdays are included in our dataset, which is very often desirable for quantitative trading strategies. Of course, some of the weekdays might be public holidays in which case no price will be available. For this reason, we will fill the missing prices with the latest available prices. All this is, again, all too easy with pandas:
# Getting just the adjusted closing prices. This will return a Pandas DataFrame # The index in this DataFrame is the major index of the panel_data. adj_close = panel_data.ix['Adj Close'] # Getting all weekdays between 01/01/2000 and 12/31/2016 all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B') # How do we align the existing prices in adj_close with our new set of dates? # All we need to do is reindex adj_close using all_weekdays as the new index adj_close = adj_close.reindex(all_weekdays) # Reindexing will insert missing values (NaN) for the dates that were not present # in the original set. To cope with this, we can fill the missing by replacing them # with the latest available price for each instrument. adj_close = adj_close.fillna(method='ffill')
adj_close contains all the adjusted closing prices for all instruments and all the dates that Yahoo returned. Some of the week days might be missing from the data Yahoo provides. For this reason we create a Series of all the weekdays between the first and last date of interest and store them in the all_weekdays variable. Getting all the weekdays is achieved by passing the
freq=’B’ named parameter to the
pd.date_range() function. This function return a
DatetimeIndex which is shown below:
>>> all_weekdays DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12', '2000-01-13', '2000-01-14', ... '2016-12-19', '2016-12-20', '2016-12-21', '2016-12-22', '2016-12-23', '2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29', '2016-12-30'], dtype='datetime64[ns]', length=4435, freq='B')
Aligning the original DataFrame with the new DatetimeIndex is accomplished by substitution of the initial DatetimeIndex of the
adj_close DataFrame. If any of the new dates were not included in the original DatetimeIndex, the prices for that date will be filled with NaNs. For this reason, we will fill any resulting NaNs with the last available price. The final, clean DataFrame is shown below:
Looking at the Data
Our dataset is now complete and free of missing values. We can see a summary of the values in each of the instrument by calling the
describe() method of a Pandas DataFrame:
Suppose we would like to plot the MSFT time-series. We would also like to see how the stock behaves compared to a short and longer term moving average of its price. A simple moving average of the original time-series is calculated by taking for each date the average of the last W prices (including the price on the date of interest). pandas has
rolling(), a built in function for Series which returns a rolling object for a user-defined window, e.g. 20 days. Once a rolling object has been obtained, a number of functions can be applied on it, such as
std() (to calculate the standard deviation of the values in the window) or
mean(). See below:
# Get the MSFT time series. This now returns a Pandas Series object indexed by date. msft = adj_close.ix[:, 'MSFT'] # Calculate the 20 and 100 days moving averages of the closing prices short_rolling_msft = msft.rolling(window=20).mean() long_rolling_msft = msft.rolling(window=100).mean() # Plot everything by leveraging the very powerful matplotlib package fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(msft.index, msft, label='MSFT') ax.plot(short_rolling_msft.index, short_rolling_msft, label='20 days rolling') ax.plot(long_rolling_msft.index, long_rolling_msft, label='100 days rolling') ax.set_xlabel('Date') ax.set_ylabel('Adjusted closing price ($)') ax.legend()
Now, finally the stock price history together with the two moving averages plotted:
All of this has been but a small preview of the way a quantitative analyst can leverage the power of Python and pandas to analyze scores of financial data. In part 2 of this series on Python and financial quantitative analysis, we are going to show how to use the two technical indicators already created to create a simple yet realistic trading strategy.