Python for Finance, Part I: Yahoo Finance API, pandas, and matplotlib

Python for Finance, Part I: Yahoo Finance API, pandas, and matplotlib

Motivation

Less than a decade ago, financial instruments called derivatives were at the height of popularity. Financial institutions around the world were trading billions of dollars of these instruments on a daily basis, and quantitative analysts were modeling them using stochastic calculus and the all mighty C++.

Fast forward nine years later and things have changed. The financial crisis has proven to be an as-to-yet derivatives-nemesis. Volumes have gone down and demand for C++ modeling has withered with them. But there is a new player in town… Python!

Python has been gaining significant traction in the financial industry over the last years and with good reason. In this series of tutorials we are going to see how one can leverage the powerful functionality provided by a number of Python packages to develop and backtest a quantitative trading strategy.

In detail, in the first of our tutorials, we are going to show how one can easily use Python to download financial data from free online databases, manipulate the downloaded data and then create some basic technical indicators which will then be used as the basis of our quantitative strategy. To accomplish that, we are going to use one of the most powerful and widely used Python packages for data manipulation, pandas.

Getting the Data

Pandas is included in the more popular distributions of Python for Windows, such as Anaconda. In case it's not included in your Python distribution, use pip: pip install pandas. Once installed, to use pandas, all one needs to do is import it. We will also need the pandas_datareader package (pip install pandas-datareader), as well as matplotlib (pip install matplotlib)for visualizing our results.

from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd

Having imported the appropriate tools, getting market data from a free online source, such as Yahoo Finance, is super easy. Since pandas has a simple remote data access for the Yahoo Finance API data, this is as simple as:

# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
tickers = ['AAPL', 'MSFT', '^GSPC']

# Define which online source one should use
data_source = 'yahoo'

# We would like all available data from 01/01/2000 until 12/31/2016.
start_date = '2000-01-01'
end_date = '2016-12-31'

# User pandas_reader.data.DataReader to load the desired data. As simple as that.
panel_data = data.DataReader(tickers, data_source, start_date, end_date)


What does panel_data look like? data.DataReader returns a Panel object, which can be thought of as a 3D matrix. The first dimension consists of the various fields Yahoo Finance returns for a given instrument, namely, the Open, High, Low, Close and Adj Close prices for each date. The second dimension contain the dates. The third one contains the instrument identifiers.

Let's see what panel_data actually is:

>>> panel_data
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 4277 (major_axis) x 3 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2000-01-03 00:00:00 to 2016-12-30 00:00:00
Minor_axis axis: AAPL to ^GSPC

Preparing the Data

Let us assume we are interested in working with the Adj Close prices which have been already been adjusted by Yahoo finance to account for corporate actions such as dividends and stock splits. We want to make sure that all weekdays are included in our dataset, which is very often desirable for quantitative trading strategies. Of course, some of the weekdays might be public holidays in which case no price will be available. For this reason, we will fill the missing prices with the latest available prices. All this is, again, all too easy with pandas:

# Getting just the adjusted closing prices. This will return a Pandas DataFrame
# The index in this DataFrame is the major index of the panel_data.
adj_close = panel_data.ix['Adj Close']

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex adj_close using all_weekdays as the new index
adj_close = adj_close.reindex(all_weekdays)

# Reindexing will insert missing values (NaN) for the dates that were not present
# in the original set. To cope with this, we can fill the missing by replacing them
# with the latest available price for each instrument.
adj_close = adj_close.fillna(method='ffill')

Initially, adj_close contains all the adjusted closing prices for all instruments and all the dates that Yahoo returned. Some of the week days might be missing from the data Yahoo provides. For this reason we create a Series of all the weekdays between the first and last date of interest and store them in the all_weekdays variable. Getting all the weekdays is achieved by passing the freq=’B’ named parameter to the pd.date_range() function. This function return a DatetimeIndex which is shown below:

>>> all_weekdays
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
               '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14',
               ...
               '2016-12-19', '2016-12-20', '2016-12-21', '2016-12-22',
               '2016-12-23', '2016-12-26', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', length=4435, freq='B')

Aligning the original DataFrame with the new DatetimeIndex is accomplished by substitution of the initial DatetimeIndex of the adj_close DataFrame. If any of the new dates were not included in the original DatetimeIndex, the prices for that date will be filled with NaNs. For this reason, we will fill any resulting NaNs with the last available price. The final, clean DataFrame is shown below:

adj_close.head(7)
Out:

AAPLMSFT^GSPC
2000-01-033.62564339.3346301455.219971
2000-01-043.31996438.0059001399.420044
2000-01-053.36854838.4066281402.109985
2000-01-063.07703937.1200801403.449951
2000-01-073.22279437.6051721441.469971
2000-01-103.16611237.8793541457.599976
2000-01-113.00416236.9091701438.560059

Looking at the Data

Our dataset is now complete and free of missing values. We can see a summary of the values in each of the instrument by calling the describe() method of a Pandas DataFrame:

>>> adj_close.describe()
Out:

AAPLMSFT^GSPC
count4435.0000004435.0000004435.000000
mean35.47530726.1509081375.881598
std38.23915410.452976361.308452
min0.84991112.306545676.530029
25%3.24101419.3845501126.334961
50%17.63176022.4943731294.869995
75%65.05444327.4093961507.724976
max127.96609163.2366272271.719971

Suppose we would like to plot the MSFT time-series. We would also like to see how the stock behaves compared to a short and longer term moving average of its price. A simple moving average of the original time-series is calculated by taking for each date the average of the last W prices (including the price on the date of interest). pandas has rolling(), a built in function for Series which returns a rolling object for a user-defined window, e.g. 20 days. Once a rolling object has been obtained, a number of functions can be applied on it, such as sum(), std() (to calculate the standard deviation of the values in the window) or mean(). See below:

# Get the MSFT time series. This now returns a Pandas Series object indexed by date.
msft = adj_close.ix[:, 'MSFT']
# Calculate the 20 and 100 days moving averages of the closing prices
short_rolling_msft = msft.rolling(window=20).mean()
long_rolling_msft = msft.rolling(window=100).mean()

# Plot everything by leveraging the very powerful matplotlib package
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(msft.index, msft, label='MSFT')
ax.plot(short_rolling_msft.index, short_rolling_msft, label='20 days rolling')
ax.plot(long_rolling_msft.index, long_rolling_msft, label='100 days rolling')
ax.set_xlabel('Date')
ax.set_ylabel('Adjusted closing price ($)')
ax.legend()

Now, finally the stock price history together with the two moving averages plotted:

>>> plt.show()

What's Next

All of this has been but a small preview of the way a quantitative analyst can leverage the power of Python and pandas to analyze scores of financial data. In part 2 of this series on Python and financial quantitative analysis, we are going to show how to use the two technical indicators already created to create a simple yet realistic trading strategy.

Georgios Efstathopoulos

Georgios Efstathopoulos

Georgios has 7+ years of experience as a quantitative analyst in the financial sector, and has worked extensively on statistical and machine learning models for quantitative trading, market and credit risk management and behavioural modelling. Georgios has PhD in Applied Mathematics and Statistics at Imperial College London, and is the founder and CEO of QuAnalytics Limited, a consultancy focusing on quantitative and data analytics solutions for individuals and organisation who wish to harvest the potential of their own data to grow their business.

Leave a Reply

Be the First to Comment!

Notify of
avatar
wpDiscuz

Send this to a friend