Why Choose Us?
0% AI Guarantee
Human-written only.
24/7 Support
Anytime, anywhere.
Plagiarism Free
100% Original.
Expert Tutors
Masters & PhDs.
100% Confidential
Your privacy matters.
On-Time Delivery
Never miss a deadline.
Project: Time Series - Forecasting Stock Prices Marks: 30 Welcome to the project on Time Series
Project: Time Series - Forecasting Stock Prices
Marks: 30
Welcome to the project on Time Series. We will use the Amazon Stock Prices dataset for this project.
## Context:
Stocks are one of the most popular financial instruments invented for building wealth and are the centerpiece of any investment portfolio. Recent advances in trading technology have opened up stock markets in such a way that nowadays, nearly anybody can own stock.
In the last few decades, there's been an explosive increase in the average person's interest for the stock market. This makes stock value prediction an interesting and popular problem to explore.
## Objective:
Amazon.com, Inc. engages in the retail sale of consumer products and subscriptions in North America as well as internationally. This dataset consists of monthly average stock closing prices of Amazon over a period of 12 years from 2006 to 2017. We have to build a time series model using the AR, MA, ARMA and ARIMA models in order to forecast the stock closing price of Amazon.
## Data Dictionary:
- date: Date when the price was collected
- close: Closing price of the stock
Importing libraries
Please note that we are downgrading the version of the statsmodels library to version 0.12.1. Due to some variation, the latest version of the library might not give us the desired results. You can run the below code to downgrade the library and avoid any issues in the output. Once the code runs successfully, either restart the kernel or restart the Jupyter Notebook before importing the statsmodels library.It is enough to run the install statsmodel cell once.To be sure you are using the correct version of the library, you can use the code in the Version check cell of the model.
!pip install statsmodels==0.12.1
# Version check
import statsmodels
statsmodels.__version__
# Importing libraries for data manipulation
import pandas as pd
import numpy as np
#Importing libraries for visualization
import matplotlib.pylab as plt
import seaborn as sns
#Importing library for date manipulation
from datetime import datetime
#To calculate the MSE or RMSE
from sklearn.metrics import mean_squared_error
#Importing acf and pacf functions
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
#Importing models from statsmodels library
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
#To ignore the warnings
import warnings
warnings.filterwarnings('ignore')
Reading the dataset
#If you are having an issue while loading the excel file in pandas, please run the below command in anaconda prompt, otherwise ignore.
#conda install -c anaconda xlrd
df = pd.read_excel('amazon_stocks_prices.xlsx')
df.head()
date close
0 2006-01-01 45.22
1 2006-02-01 38.82
2 2006-03-01 36.38
3 2006-04-01 36.32
4 2006-05-01 34.13
Checking info
Question 1: Check the info of the dataset and write your observations. (2 Marks)
#Write your code here
Observations:_____________
# Setting date as the index
df = df.set_index(['date'])
df.head()
close
date
2006-01-01 45.22
2006-02-01 38.82
2006-03-01 36.38
2006-04-01 36.32
2006-05-01 34.13
Now, let's visualize the time series to get an idea about the trend and/or seasonality within the data.
# Visualizing the time series
plt.figure(figsize=(16,8))
plt.xlabel("Month")
plt.ylabel("Closing Prices")
plt.title('Amazon Stock Prices')
plt.plot(df.index, df.close, color = 'c', marker='.')
[<matplotlib.lines.Line2D at 0x28d75276520>]
Observations:
- We can see that the series has an upward trend with some seasonality. This implies that the average stock price of Amazon has been increasing almost every year.
- Before building different models, it is important to check whether the series is stationary or not.
Let us first split the dataset into train and test data
Splitting the dataset
# Splitting the data into train and test
df_train = df.loc['2006-01-01':'2015-12-01']
df_test = df.loc['2016-01-01' : '2017-12-01']
print(df_train)
print(df_test)
close
date
2006-01-01 45.22
2006-02-01 38.82
2006-03-01 36.38
2006-04-01 36.32
2006-05-01 34.13
... ...
2015-08-01 518.46
2015-09-01 520.96
2015-10-01 566.74
2015-11-01 657.70
2015-12-01 669.26
[120 rows x 1 columns]
close
date
2016-01-01 601.06
2016-02-01 530.62
2016-03-01 572.37
2016-04-01 613.59
2016-05-01 697.47
2016-06-01 716.39
2016-07-01 741.47
2016-08-01 764.84
2016-09-01 788.97
2016-10-01 824.44
2016-11-01 763.34
2016-12-01 763.33
2017-01-01 807.51
2017-02-01 835.75
2017-03-01 854.24
2017-04-01 903.39
2017-05-01 961.72
2017-06-01 990.44
2017-07-01 1008.48
2017-08-01 971.44
2017-09-01 968.99
2017-10-01 1000.72
2017-11-01 1139.81
2017-12-01 1168.84
Now let us check the rolling mean and standard deviation of the series to visualize if the series has any trend or seasonality.
Testing the stationarity of the series
# Calculating the rolling mean and standard deviation for a window of 12 observations
rolmean=df_train.rolling(window=12).mean()
rolstd=df_train.rolling(window=12).std()
#Visualizing the rolling mean and standard deviation
plt.figure(figsize=(16,8))
actual = plt.plot(df_train, color='c', label='Actual Series')
rollingmean = plt.plot(rolmean, color='red', label='Rolling Mean')
#rollingstd = plt.plot(rolstd, color='green', label='Rolling Std. Dev.')
plt.title('Rolling Mean & Standard Deviation of the Series')
plt.legend()
plt.show()
Observations:
- We can see there is an upward trend in the series.
- We can confirm that the series is not stationary.
We can also use the Augmented Dickey-Fuller (ADF) Test to verify if the series is stationary or not. The null and alternate hypotheses for the ADF Test are defined as:
- Null hypothesis: The Time Series is non-stationary
- Alternative hypothesis: The Time Series is stationary
#Define a function to use adfuller test
def adfuller(df_train):
#Importing adfuller using statsmodels
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(df_train['close'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(df_train)
Dickey-Fuller Test:
Test Statistic 3.464016
p-value 1.000000
Lags Used 0.000000
No. of Observations 119.000000
Critical Value (1%) -3.486535
Critical Value (5%) -2.886151
Critical Value (10%) -2.579896
dtype: float64
Observations:
- From the above test, we can see that the p-value = 1 i.e. > 0.05 (For 95% confidence intervals) therefore, we fail to reject the null hypothesis.
- Hence, we can confirm that the series is non-stationary.
Making the series stationary
We can use some of the following methods to convert a non-stationary series into a stationary one:
- Log Transformation
- By differencing the series (lagged series)
Let's first use a log transformation over this series to remove exponential variance and check the stationarity of the series again.
# Visualize the rolling mean and standard deviation after using log transformation
plt.figure(figsize=(16,8))
df_log = np.log(df_train)
MAvg = df_log.rolling(window=12).mean()
MStd = df_log.rolling(window=12).std()
plt.plot(df_log)
plt.plot(MAvg, color='r', label = 'Moving Average')
plt.plot(MStd, color='g', label = 'Standard Deviation')
plt.legend()
plt.show()
Observations:
- Since we can still see the upward trend in the series, we can conclude that the series is still non-stationary.
- However, the standard deviation is almost constant which implies that now the series has constant variance.
Let's shift the series by order 1 (or by 1 month) & apply differencing (using lagged series) and then check the rolling mean and standard deviation.
Question 2: Visualize the rolling mean and rolling standard deviation of the shifted series (df_shift) and check the stationarity by calling the adfuller() function. Also, write your observations on the same. (3 Marks)
plt.figure(figsize=(16,8))
df_shift = df_log - df_log.shift(periods = 1)
MAvg_shift = ______________________________
MStd_shift = ______________________________
plt.plot(_________, color='c')
plt.plot(__________, color='red', label = 'Moving Average')
plt.plot(__________, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
#Dropping the null values that we get after applying differencing method
df_shift = df_shift.dropna()
Observations:___________________
Let us use the adfuller test to check the stationarity.
#____________________ # call the adfuller function for df_shift series
Observations:
- ________________________
Let's decompose the time series to check its different components.
Decomposing the time series components into Trend, Seasonality and Residual
#Importing the seasonal_decompose function to decompose the time series
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df_train)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.figure(figsize=(15,10))
plt.subplot(411)
plt.plot(df_train, label='Actual', marker='.')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', marker='.')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality', marker='.')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', marker='.')
plt.legend(loc='upper left')
plt.tight_layout()
Observations:
- We can see that there are significant trend, seasonality and residuals components in the series
- The plot for seasonality shows that Amazon's stock prices spike in July, September, and December.
Now let's move on to the model building section. First, we will plot the ACF and PACF plots to get the values of p and q i.e. order of AR and MA models to be used.
Plotting the auto-correlation function and partial auto-correlation function to get p and q values for AR, MA, ARMA, and ARIMA models
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plt.figure(figsize = (16,8))
plot_acf(df_shift, lags = 12)
plt.show()
plot_pacf(df_shift, lags = 12)
plt.show()
<Figure size 1152x576 with 0 Axes>
Observations:
- From the above PACF plot we can see that the highest lag at which the plot extends beyond the statistically significant boundary is lag 1.
- This indicates that an AR Model of lag 1 (p=1) should be sufficient to fit the data.
- Similarly, from the ACF plot, we can infer that q=1.
AR Model
Question 3: Fit and predict the shifted series with the AR Model and calculate the RMSE. Also, visualize the time series and write your observations. (5 Marks)
#Importing AutoReg function to apply AR model
from statsmodels.tsa.ar_model import AutoReg
plt.figure(figsize=(16,8))
model_AR = _______________ #Use number of lags as 1 and apply AutoReg function on df_shift series
results_AR = ________________ #fit the model
plt.plot(df_shift)
predict = ______________________________ #predict the series
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(___________, color='red')
plt.title('AR Model - RMSE: %.4f'% mean_squared_error(predict,df_shift['close'], squared=False)) #Calculating rmse
plt.show()
Observations:________________________
Let's check the AIC value of the model
results_AR.aic
-4.781419615400342
Now, let's build MA, ARMA, and ARIMA models as well, and see if we can get a better model
MA Model
We will be using an ARIMA model with p=0 and d=0 so that it will work as an MA model
Question 4: Fit and predict the shifted series with the MA Model and calculate the RMSE. Also, visualize the time series and write your observations. (2 Marks)
from statsmodels.tsa.arima_model import ARIMA
plt.figure(figsize=(16,8))
model_MA = _________________ #Using p=0, d=0, q=1 and apply ARIMA function on df_shift series
results_MA = _____________ #fit the model
plt.plot(________)
plt.plot(_________________, color='red')
plt.title('MA Model - RMSE: %.4f'% mean_squared_error(results_MA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:________________________
Let's check the AIC value of the model
results_MA.aic
-229.09493835479742
- The MA model is giving a much lower AIC when compared to the AR model, implying that the MA model fits the training data better.
ARMA Model
We will be using an ARIMA model with p=1 and q=1 (as observed from the ACF and PACF plots) and d=0 so that it will work as an ARMA model.
Question 5: Fit and predict the shifted series with the ARMA Model and calculate the RMSE. Also, visualize the time series and write your observations. (2 Marks)
plt.figure(figsize=(16,8))
model_ARMA = _______________ #Using p=1, d=0, q=1 and apply ARIMA function on df_shift series
results_ARMA = _______________ #fit the model
plt.plot(_______)
plt.plot(___________, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:
- ________________________
Let's check the AIC value of the model
results_ARMA.aic
-227.11129132564088
- The AIC value of the ARMA model is more or less similar to MA model
Let us try using the ARIMA Model.
ARIMA Model
We will be using an ARIMA Model with p=1, d=1, & q=1.
Question 6: Fit and predict the shifted series with the ARIMA Model and calculate the RMSE. Also, visualize the time series and write your observations. (2 Marks)
from statsmodels.tsa.arima_model import ARIMA
plt.figure(figsize=(16,8))
model_ARIMA = ________________ #Using p=1, d=1, q=1 and apply ARIMA function on df_log series
results_ARIMA = ___________________ #fit the model
plt.plot(___________)
plt.plot(________________, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues,df_shift['close'], squared=False))
plt.show()
Observations:________
Let's check the AIC value of the model
results_ARIMA.aic
-227.11129236959937
- The AIC value of the ARIMA model is the same as the ARMA model.
We can see that all the models return almost the same RMSE. There is not much difference in AIC value as well across all the models except for the AR model.
We can choose to predict the values using ARIMA as it takes into account more factors than AR, MA, ARMA models.
# Printing the fitted values
predictions=pd.Series(results_ARIMA.fittedvalues)
predictions
date
2006-02-01 0.022235
2006-03-01 -0.019667
2006-04-01 0.009184
2006-05-01 0.018985
2006-06-01 0.001615
...
2015-08-01 0.043234
2015-09-01 0.032286
2015-10-01 0.015696
2015-11-01 0.039276
2015-12-01 0.050567
Length: 119, dtype: float64
Inverse Transformation
Now we have fitted values using the ARIMA model, we will use the inverse transformation to get back the original values.
Question 7: Apply an inverse transformation on the predictions of the ARIMA Model. (5 Marks)
#First step - doing cumulative sum
predictions_cumsum = _______________ # use .cumsum fuction on the predictions
predictions_cumsum
#Second step - Adding the first value of the log series to the cumulative sum values
predictions_log = pd.Series(df_log['close'].iloc[0], index=df_log.index)
predictions_log = predictions_log.add(predictions_cumsum, fill_value=0)
predictions_log
#Third step - applying exponential transformation
predictions_ARIMA = _________________ #use exponential function
predictions_ARIMA
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(________, color = 'c', label = 'Original Series') #plot the original train series
plt.plot(_____________, color = 'r', label = 'Predicted Series') #plot the predictions_ARIMA
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations:
- We can see that the predicted series is very similar to the original series i.e. The model is good at predicting values on the training data except for the dip in stock prices in 2015 which may have been due to some external factors that are not included in this model.
- Let us forecast the closing prices for the next 24 months.
Forecasting the values for next 24 months and compare it with test data
To forecast the values for the next 24 months using the ARIMA model, we need to follow the steps below:
- Forecast the log-transformed fitted values for the next 24 months
- Make a list of these 24 month (2016-2017) forecasted values
- Convert that list into a series so that we can work with pandas functions
- Make a dataframe where we have the dates starting from 2016-01-01 to 2017-12-01 as the index and the respective forecasted values
- Apply the inverse transformation and get the real forecasted values
Question 8: Forecast the stocks prices for the next 24 months and perform the inverse transformation. (5 Marks)
#Forecasting the values for next 24 months
forecasted_ARIMA = _____________________ #forecast using the results_ARIMA for next 24 months. Keep steps=24
forecasted_ARIMA[0]
# Creating a list containing all the forecasted values
list1 = forecasted_ARIMA[0].tolist()
series1 = pd.Series(list1)
series1
#Making a new dataframe to get the additional dates from 2016-2018
index = pd.date_range('2016-01-1','2018-1-1' , freq='1M')- pd.offsets.MonthBegin(1)
df1 = pd.DataFrame()
df1['forecasted'] = series1
df1.index = index
df1
#Applying exponential transformation to the forecasted log values
forecasted_ARIMA = _________________ #use exponential function on forecasted data
forecasted_ARIMA
Now, let's try to visualize the original data with the predicted values on the training data and the forecasted values.
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(df, color = 'c', label = 'Original Series')
plt.plot(__________, color = 'r', label = 'Prediction on Train data') #plot the predictions_ARIMA series
plt.plot(__________, label = 'Prediction on Test data', color='b') #plot the forecasted_ARIMA series
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations:
- As observed earlier, most of the predicted values on the training data are very close to the actual values except for the dip in stock prices in the year 2015.
- On the test data, the model is able to correctly predict the trend of the stock prices, as we can see that the blue line appears to be close to the actual values (cyan blue) and they both have an upward trend. However the test predictions are not able to identify the volatile variations in the stock prices over the last 2 years.
Let's test the RMSE of the transformed predictions and the original value on the training and testing data to check whether the model is giving a generalized performance or not.
Question 9: Check the RMSE on the original train and test data and write your conclusion from the above analysis. (4 Marks)
from sklearn.metrics import mean_squared_error
error = ____________________________ #calculate RMSE using the predictions_ARIMA and df_train
error
from sklearn.metrics import mean_squared_error
error = _________________________ #calculate RMSE using the forecasted_ARIMA and df_test
Conclusion
Write your conclusion here
Expert Solution
Buy This Solution
For ready-to-submit work, please order a fresh solution below.





