Fill This Form To Receive Instant Help
Homework answers / question archive / Project: Time Series - Forecasting Stock Prices Marks: 30 Welcome to the project on Time Series
Welcome to the project on Time Series. We will use the Amazon Stock Prices dataset for this project.
Stocks are one of the most popular financial instruments invented for building wealth and are the centerpiece of any investment portfolio. Recent advances in trading technology have opened up stock markets in such a way that nowadays, nearly anybody can own stock.
In the last few decades, there's been an explosive increase in the average person's interest for the stock market. This makes stock value prediction an interesting and popular problem to explore.
Amazon.com, Inc. engages in the retail sale of consumer products and subscriptions in North America as well as internationally. This dataset consists of monthly average stock closing prices of Amazon over a period of 12 years from 2006 to 2017. We have to build a time series model using the AR, MA, ARMA and ARIMA models in order to forecast the stock closing price of Amazon.
Please note that we are downgrading the version of the statsmodels library to version 0.12.1. Due to some variation, the latest version of the library might not give us the desired results. You can run the below code to downgrade the library and avoid any issues in the output. Once the code runs successfully, either restart the kernel or restart the Jupyter Notebook before importing the statsmodels library.It is enough to run the install statsmodel cell once.To be sure you are using the correct version of the library, you can use the code in the Version check cell of the model.
!pip install statsmodels==0.12.1
# Version check
import statsmodels
statsmodels.__version__
# Importing libraries for data manipulation
import pandas as pd
import numpy as np
#Importing libraries for visualization
import matplotlib.pylab as plt
import seaborn as sns
#Importing library for date manipulation
from datetime import datetime
#To calculate the MSE or RMSE
from sklearn.metrics import mean_squared_error
#Importing acf and pacf functions
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
#Importing models from statsmodels library
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
#To ignore the warnings
import warnings
warnings.filterwarnings('ignore')
#If you are having an issue while loading the excel file in pandas, please run the below command in anaconda prompt, otherwise ignore.
#conda install -c anaconda xlrd
df = pd.read_excel('amazon_stocks_prices.xlsx')
df.head()
date close
0 2006-01-01 45.22
1 2006-02-01 38.82
2 2006-03-01 36.38
3 2006-04-01 36.32
4 2006-05-01 34.13
#Write your code here
Observations:_____________
# Setting date as the index
df = df.set_index(['date'])
df.head()
close
date
2006-01-01 45.22
2006-02-01 38.82
2006-03-01 36.38
2006-04-01 36.32
2006-05-01 34.13
Now, let's visualize the time series to get an idea about the trend and/or seasonality within the data.
# Visualizing the time series
plt.figure(figsize=(16,8))
plt.xlabel("Month")
plt.ylabel("Closing Prices")
plt.title('Amazon Stock Prices')
plt.plot(df.index, df.close, color = 'c', marker='.')
[<matplotlib.lines.Line2D at 0x28d75276520>]
Observations:
Let us first split the dataset into train and test data
# Splitting the data into train and test
df_train = df.loc['2006-01-01':'2015-12-01']
df_test = df.loc['2016-01-01' : '2017-12-01']
print(df_train)
print(df_test)
close
date
2006-01-01 45.22
2006-02-01 38.82
2006-03-01 36.38
2006-04-01 36.32
2006-05-01 34.13
... ...
2015-08-01 518.46
2015-09-01 520.96
2015-10-01 566.74
2015-11-01 657.70
2015-12-01 669.26
[120 rows x 1 columns]
close
date
2016-01-01 601.06
2016-02-01 530.62
2016-03-01 572.37
2016-04-01 613.59
2016-05-01 697.47
2016-06-01 716.39
2016-07-01 741.47
2016-08-01 764.84
2016-09-01 788.97
2016-10-01 824.44
2016-11-01 763.34
2016-12-01 763.33
2017-01-01 807.51
2017-02-01 835.75
2017-03-01 854.24
2017-04-01 903.39
2017-05-01 961.72
2017-06-01 990.44
2017-07-01 1008.48
2017-08-01 971.44
2017-09-01 968.99
2017-10-01 1000.72
2017-11-01 1139.81
2017-12-01 1168.84
Now let us check the rolling mean and standard deviation of the series to visualize if the series has any trend or seasonality.
# Calculating the rolling mean and standard deviation for a window of 12 observations
rolmean=df_train.rolling(window=12).mean()
rolstd=df_train.rolling(window=12).std()
#Visualizing the rolling mean and standard deviation
plt.figure(figsize=(16,8))
actual = plt.plot(df_train, color='c', label='Actual Series')
rollingmean = plt.plot(rolmean, color='red', label='Rolling Mean')
#rollingstd = plt.plot(rolstd, color='green', label='Rolling Std. Dev.')
plt.title('Rolling Mean & Standard Deviation of the Series')
plt.legend()
plt.show()
Observations:
We can also use the Augmented Dickey-Fuller (ADF) Test to verify if the series is stationary or not. The null and alternate hypotheses for the ADF Test are defined as:
#Define a function to use adfuller test
def adfuller(df_train):
#Importing adfuller using statsmodels
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(df_train['close'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(df_train)
Dickey-Fuller Test:
Test Statistic 3.464016
p-value 1.000000
Lags Used 0.000000
No. of Observations 119.000000
Critical Value (1%) -3.486535
Critical Value (5%) -2.886151
Critical Value (10%) -2.579896
dtype: float64
Observations:
We can use some of the following methods to convert a non-stationary series into a stationary one:
Let's first use a log transformation over this series to remove exponential variance and check the stationarity of the series again.
# Visualize the rolling mean and standard deviation after using log transformation
plt.figure(figsize=(16,8))
df_log = np.log(df_train)
MAvg = df_log.rolling(window=12).mean()
MStd = df_log.rolling(window=12).std()
plt.plot(df_log)
plt.plot(MAvg, color='r', label = 'Moving Average')
plt.plot(MStd, color='g', label = 'Standard Deviation')
plt.legend()
plt.show()
Observations:
Let's shift the series by order 1 (or by 1 month) & apply differencing (using lagged series) and then check the rolling mean and standard deviation.
plt.figure(figsize=(16,8))
df_shift = df_log - df_log.shift(periods = 1)
MAvg_shift = ______________________________
MStd_shift = ______________________________
plt.plot(_________, color='c')
plt.plot(__________, color='red', label = 'Moving Average')
plt.plot(__________, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
#Dropping the null values that we get after applying differencing method
df_shift = df_shift.dropna()
Observations:___________________
Let us use the adfuller test to check the stationarity.
#____________________ # call the adfuller function for df_shift series
Observations:
Let's decompose the time series to check its different components.
#Importing the seasonal_decompose function to decompose the time series
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df_train)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.figure(figsize=(15,10))
plt.subplot(411)
plt.plot(df_train, label='Actual', marker='.')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', marker='.')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality', marker='.')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', marker='.')
plt.legend(loc='upper left')
plt.tight_layout()
Observations:
Now let's move on to the model building section. First, we will plot the ACF and PACF plots to get the values of p and q i.e. order of AR and MA models to be used.
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plt.figure(figsize = (16,8))
plot_acf(df_shift, lags = 12)
plt.show()
plot_pacf(df_shift, lags = 12)
plt.show()
<Figure size 1152x576 with 0 Axes>
Observations:
#Importing AutoReg function to apply AR model
from statsmodels.tsa.ar_model import AutoReg
plt.figure(figsize=(16,8))
model_AR = _______________ #Use number of lags as 1 and apply AutoReg function on df_shift series
results_AR = ________________ #fit the model
plt.plot(df_shift)
predict = ______________________________ #predict the series
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(___________, color='red')
plt.title('AR Model - RMSE: %.4f'% mean_squared_error(predict,df_shift['close'], squared=False)) #Calculating rmse
plt.show()
Observations:________________________
Let's check the AIC value of the model
results_AR.aic
-4.781419615400342
Now, let's build MA, ARMA, and ARIMA models as well, and see if we can get a better model
We will be using an ARIMA model with p=0 and d=0 so that it will work as an MA model
from statsmodels.tsa.arima_model import ARIMA
plt.figure(figsize=(16,8))
model_MA = _________________ #Using p=0, d=0, q=1 and apply ARIMA function on df_shift series
results_MA = _____________ #fit the model
plt.plot(________)
plt.plot(_________________, color='red')
plt.title('MA Model - RMSE: %.4f'% mean_squared_error(results_MA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:________________________
Let's check the AIC value of the model
results_MA.aic
-229.09493835479742
We will be using an ARIMA model with p=1 and q=1 (as observed from the ACF and PACF plots) and d=0 so that it will work as an ARMA model.
plt.figure(figsize=(16,8))
model_ARMA = _______________ #Using p=1, d=0, q=1 and apply ARIMA function on df_shift series
results_ARMA = _______________ #fit the model
plt.plot(_______)
plt.plot(___________, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:
Let's check the AIC value of the model
results_ARMA.aic
-227.11129132564088
Let us try using the ARIMA Model.
We will be using an ARIMA Model with p=1, d=1, & q=1.
from statsmodels.tsa.arima_model import ARIMA
plt.figure(figsize=(16,8))
model_ARIMA = ________________ #Using p=1, d=1, q=1 and apply ARIMA function on df_log series
results_ARIMA = ___________________ #fit the model
plt.plot(___________)
plt.plot(________________, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:________
Let's check the AIC value of the model
results_ARIMA.aic
-227.11129236959937
We can see that all the models return almost the same RMSE. There is not much difference in AIC value as well across all the models except for the AR model.
We can choose to predict the values using ARIMA as it takes into account more factors than AR, MA, ARMA models.
# Printing the fitted values
predictions=pd.Series(results_ARIMA.fittedvalues)
predictions
date
2006-02-01 0.022235
2006-03-01 -0.019667
2006-04-01 0.009184
2006-05-01 0.018985
2006-06-01 0.001615
...
2015-08-01 0.043234
2015-09-01 0.032286
2015-10-01 0.015696
2015-11-01 0.039276
2015-12-01 0.050567
Length: 119, dtype: float64
Now we have fitted values using the ARIMA model, we will use the inverse transformation to get back the original values.
#First step - doing cumulative sum
predictions_cumsum = _______________ # use .cumsum fuction on the predictions
predictions_cumsum
#Second step - Adding the first value of the log series to the cumulative sum values
predictions_log = pd.Series(df_log['close'].iloc[0], index=df_log.index)
predictions_log = predictions_log.add(predictions_cumsum, fill_value=0)
predictions_log
#Third step - applying exponential transformation
predictions_ARIMA = _________________ #use exponential function
predictions_ARIMA
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(________, color = 'c', label = 'Original Series') #plot the original train series
plt.plot(_____________, color = 'r', label = 'Predicted Series') #plot the predictions_ARIMA
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations:
To forecast the values for the next 24 months using the ARIMA model, we need to follow the steps below:
#Forecasting the values for next 24 months
forecasted_ARIMA = _____________________ #forecast using the results_ARIMA for next 24 months. Keep steps=24
forecasted_ARIMA[0]
# Creating a list containing all the forecasted values
list1 = forecasted_ARIMA[0].tolist()
series1 = pd.Series(list1)
series1
#Making a new dataframe to get the additional dates from 2016-2018
index = pd.date_range('2016-01-1','2018-1-1' , freq='1M')- pd.offsets.MonthBegin(1)
df1 = pd.DataFrame()
df1['forecasted'] = series1
df1.index = index
df1
#Applying exponential transformation to the forecasted log values
forecasted_ARIMA = _________________ #use exponential function on forecasted data
forecasted_ARIMA
Now, let's try to visualize the original data with the predicted values on the training data and the forecasted values.
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(df, color = 'c', label = 'Original Series')
plt.plot(__________, color = 'r', label = 'Prediction on Train data') #plot the predictions_ARIMA series
plt.plot(__________, label = 'Prediction on Test data', color='b') #plot the forecasted_ARIMA series
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations:
Let's test the RMSE of the transformed predictions and the original value on the training and testing data to check whether the model is giving a generalized performance or not.
from sklearn.metrics import mean_squared_error
error = ____________________________ #calculate RMSE using the predictions_ARIMA and df_train
error
from sklearn.metrics import mean_squared_error
error = _________________________ #calculate RMSE using the forecasted_ARIMA and df_test
Write your conclusion here
Already member? Sign In