Welcome to “Practice with Eskwelabs!” By the end of this blog post, you’ll have completed a data project of your own! But don’t worry, you won’t be going through it alone. Every “Practice with Eskwelabs” blog post features a friendly data scientist BFF who will give you a step by step tutorial and introduction to the concepts being used.

Practice with Basty

Hi, everyone! My name is Basty, and I’m a data scientist and educator from Metro Manila. If there’s something I value so much, that is education. Education has allowed me to not only dive into the amazing world of code and data, but also to encourage and inspire others to do the same. Read more about me here.

Welcome back, my friend! Today’s article will be a continuation of the previous one where we discussed the basics of Time Series. For this piece in particular, we’ll be learning about Time Series Forecasting, specifically using SARIMA, and after that I’ll be taking you through the task of implementing a Time Series Forecasting model on the Manila Sea Level Data. Let’s not waste anymore time (no pun intended), and let’s start forecasting!

What is ARIMA?

Let’s first recall what Time Series Forecasting means - it is the process of analyzing and modeling time series data to make future decisions, thus forecasting! In the previous article, we also learned that time series forecasting is applied to weather forecasting, sales forecasting, business forecasting, stock price forecasting, and a lot more.

I’ve already mentioned that we’ll be focusing on time series forecasting this time, specifically with SARIMA, but first, it’s important that we understand what ARIMA is. In time series, there is a term known as stationarity, which is a property or attribute of a time series data where the data do not change over time. And so, we are met with the most commonly used stationary time series model - ARIMA.

ARIMA stands for Autoregressive Integrated Moving Average, and you guessed it, it’s used to forecast or predict future points in the time series data. ARIMA models are applied in cases where the data is non-stationary, but don’t worry because the letter I in ARIMA handles the process of transforming non-stationary data into stationary data.

ARIMA models have three parameters, and is generally denoted as ARIMA(p, d, q):

  • p: the number of lagged values that need to be added or subtracted from the values.
    * take note that lag refers to a fixed amount of passing time
  • d: represents the number of times the data needs to differentiate to produce a stationary signal. If it’s stationary data, the value of d should be 0, and if it’s seasonal data, the value of d should be 1.
  • q: the number of lagged values for the error term added or subtracted from the values.
Now you might be wondering how to get the values of p and q, well there are some standard methods to determine optimal values of p and q, and that is through analyzing Autocorrelation and Partial Autocorrelation function plots. Here are two example of what they both look like:

Source: Towards Data Science


Source: Towards Data Science

We won’t be diving deep into these concepts, but generally, the blue bars represent the threshold/confidence intervals with a lag that starts at 0, which is the correlation of the time series with itself and thus has a correlation of 1. Here is an additional resource you can check for understanding these concepts in depth.

Now that we know that if the data is stationary, we can use ARIMA to forecast future values, but that would also mean ARIMA models wouldn’t perform well on seasonal time series data. And so, what model should we use if the data is seasonal? Enter SARIMA!

What is SARIMA?

Seasonal Autoregressive Integrated Moving Average or also known as SARIMA, is actually the combination of simpler models to make a complex model that can model time series data that shows non-stationary properties and seasonality. This model is actually very similar to ARIMA, but this one takes into account the seasonality of the dataset.

In SARIMA, we will also be dealing with similar parameters from ARIMA such as p (Autoregression model), q (Moving Average), and d (Order of Integration), but we will also deal with another parameter known as seasonality S(p, d, q, s), where s is the length of the season. Similar to ARIMA, the p and q values of the model can also be gotten from the ACF and PACF plots.

In addition to Autoregressive (AR) and Moving Average (MA), I’ve also mentioned that there are two other parameters that we use for SARIMA. The third one is the order of integration (d). Similarly to ARIMA, this parameter represents the number of differences required to make the series stationary. The value of d is either 0 or 1. If the data is stationary, we use 0, and alternatively if the data is seasonal, we use 1. Lastly, we also have the seasonality (s), where s is the season’s length.

Time Series Forecasting with SARIMA

That was quite a lot of theory to take into mind! Let’s now apply the SARIMA model in your first time series project! For this step-by-step tutorial, we’ll be using Permanent Service for Mean Sea Level’s Annual Tide Gauge Data.

You can grab the dataset here!

You can also download the Jupyter notebook of the code described in this tutorial here.

Let’s first import the necessary libraries, and read the data.

import pandas as pd 
import matplotlib.pyplot as plt
import datetime

df = pd.read_csv("data2.csv", delimiter=";")

df.head()

Data Cleaning

Now we need to make sure that our date column is in datetime format. After that we’ll be setting it as our index.
df['date'] = pd.to_datetime(df['Date'], format='%Y')

# set the date column into index
df = df.set_index('date')

# drop the date column
df.drop(columns=['Date'], inplace=True)

df.head()

Make sure to also check if there are any negative values that could mess up our analysis.

(df < 0).values.any()

# Removing any negative values that may affect the plot
df = df[df.select_dtypes(include=[np.number]).ge(0).all(1)]

Plotting Time Series

Now let’s visualize the rising sea level of Manila before moving forward.

plt.plot(df.index, df['Sea_level'])

Looking at the graph, there is definitely an upward trend ever since the 1970's and it has been increasing rapidly, most especially during the 2000’s. One possible reason for this is human-caused global warming. And because of global warming, ice sheets melt and seawater expands, which contributes to the added amounts of water.

Another reason why sea level rises rapidly in Manila is due to the rapid extraction of groundwater due to population growth and urbanization, which results in a 13.24 millimeters rise per year. If you’d like to read more on this, here is an article published by philstar.

The next thing we need to do is to forecast our data, however, we also need to identify if there’s seasonality in the data before using our model. To check whether our data is stationary or seasonal, we can use the seasonal decomposition method that splits the data into trend, seasonal, and residuals for a better understanding of the time series data.


result = seasonal_decompose(df['Sea_level'], model='multiplicative', freq=30)

fig = plt.figure()
fig = result.plot()
fig.set_size_inches(15,10)
plt.suptitle("Manila Yearly Sea Level Rise", y=1.01)

So our data is seasonal, and that is why we are using the SARIMA model to forecast this data. Like I’ve mentioned, we need to find the p, d, and q values in order to use the model. We can find p by using ACF, and q by using PACF. For the value of our d, we will be using 1 since the data is seasonal.

Now let’s plot our ACF and PACF. We can use the plot_acf() and plot_pacf() functions available in the statsmodels library. The value of our p and q will be based on the maximum value in the ACF and PACF plot outside the confidence intervals (light blue). In our plots, the values are p = 9 and q = 1.

plot_acf(df['Sea_level'], lags =12)
plt.show()
plot_pacf(df['Sea_level'], lags =12)
plt.show()

Now we’re ready to build our SARIMA model! We can use the SARIMAX class provided also by the statsmodel library. We will then fit our model and get the predictions through the get_prediction() function. We will also get the confidence intervals using the conf_int() function.


p = 8
d = 1
q = 1
model = SARIMAX(ts, order=(p,d,q))
model_fit = model.fit(disp=1,solver='powell')
    
fcast = model_fit.get_prediction(start=1, end=len(ts))
ts_p = fcast.predicted_mean
ts_ci = fcast.conf_int()

Before we plot the results, there is some necessary preprocessing we need to do with the results data that we have.

next_years = [pd.to_datetime('2021-01-01'),
              pd.to_datetime('2022-01-01'),
              pd.to_datetime('2023-01-01'),
              pd.to_datetime('2024-01-01'),
              pd.to_datetime('2025-01-01'),
              pd.to_datetime('2026-01-01'),
              pd.to_datetime('2027-01-01'),
              pd.to_datetime('2028-01-01'),
              pd.to_datetime('2029-01-01'),
              pd.to_datetime('2030-01-01'),
              pd.to_datetime('2031-01-01')]
forecast_years = pd.Series(next_years)

forecast_years = forecast_years.to_frame()
forecast_years.reset_index(inplace=True)
forecast_years.drop(columns=['index'], inplace=True)
forecast_years.rename(columns= {0: "date"}, inplace=True)


df2 = pd.read_csv('data2.csv')
df2['date'] = pd.to_datetime(df2['Date'], format='%Y')
df2 = df2[df2.select_dtypes(include=[np.number]).ge(0).all(1)]
df2.drop(columns=['Date', 'Sea_level'], inplace=True)
df2.reset_index(inplace=True)
df2.drop(columns=['index'], inplace=True)
df2 = df2.append(forecast_years, ignore_index=True)
df2.head()

# converting predictions series to df
ts_pred = ts_p.to_frame()
ts_pred.reset_index(inplace=True)
ts_pred.drop(columns=['index'], inplace=True)
ts_pred.rename(columns= {0: "pred"}, inplace=True)
ts_pred.head()

# Creating predictions columns
df2['Sea_level'] = ts_pred['pred']
df2.set_index('date', inplace=True)
df2.head()

# Creating new copy of df
df3 = pd.read_csv('data2.csv')
df3['date'] = pd.to_datetime(df3['Date'], format='%Y')
#df.set_index('date', inplace=True)
df3 = df3[df3.select_dtypes(include=[np.number]).ge(0).all(1)]
df3.drop(columns=['Date', 'Sea_level'], inplace=True)
df3.reset_index(inplace=True)
df3.drop(columns=['index'], inplace=True)
df3 = df3.append(forecast_years, ignore_index=True)

# cleaning confidence intervals dataframes
ts_ci.reset_index(inplace=True)
ts_ci.drop(columns=['index'], inplace=True

# creating confidence intervals columns into df
df3['lower Sea_level'] = ts_ci['lower Sea_level']
df3['upper Sea_level'] = ts_ci['upper Sea_level']
df3.set_index('date', inplace=True)
df3.head()

Finally, let’s plot the predictions!

plt.plot(df['Sea_level'],color='red',label='actual')
plt.fill_between(df3.index[1:],
                df3.iloc[1:, 0],
                df3.iloc[1:, 1], color='k', alpha=.2)

plt.ylabel('Sea Level Mean')
plt.legend()
plt.tight_layout()
plt.grid()
plt.show()

There we have it, we just forecasted the rise of sea level in Manila using SARIMA! Now the question is, what does this graph tell us? Clearly, we can see that the sea levels continued to rise, quite rapidly, in the coming years. We can also see the confidence intervals for each prediction represented as the gray area, which goes to show that there is uncertainty in any prediction. As I’ve already mentioned, a possible (and obvious) reason for this is global warming, which melted the ice sheets and expanded seawater.

Despite this being an obvious cause, it is also worth noting that this might not be all due to global warming. As I have also mentioned, the rapid rise of sea level in Manila is also due to the rapid extraction of groundwater. Here are some other reasons that may have contributed to rise of sea levels:

  • Excessive Reclamation - land reclamation has also increased tidal range and risk of floodings in several areas.
  • River Discharge - Sea level rises as the river discharge increases.
  • Subsidence - In coastal areas, sinking land, known as subsidence, leads to higher sea level and increased flood risk.

With the data at hand, and the rapid rise of sea level, what should our actionable steps be? In Manila, there are already some people who are working to prioritize climate action, which also includes strategies to curb greenhouse gas emission.

Summary

Woah, we sure did go through a lot of stuff! Pat yourself at the back because you did a great job. Let’s summarize what we did shall we? First, we learned about ARIMA, which stands for Autoregressive Integrated Moving Average, and we also learned about SARIMA, which stands for Seasonal Autoregressive Integrated Moving Average. These are algorithms we use for forecasting time series data.

It is also important that you remember that if the data is stationary, we need to use ARIMA, and if the data is seasonal, we use SARIMA. As we learned about these algorithms, we also tackled several time series concepts such as stationarity, ACF, and PACF. And then finally, you did your very first time series project by forecasting the rise of sea level in Manila.

I hope you liked this article about time series forecasting with SARIMA using Python. If you’re interested in learning more about time series, make sure to check out the 12-week Data Science Fellowship, where you’ll be applying time series analysis on the music industry!

Never stop learning!

From the notebook of Basty Vergara | Connect with Basty via LinkedIn and Notion

Every “Practice with Eskwelabs” blog post features a friendly data scientist BFF who will give you a step by step tutorial and introduction to the concepts being used.

RECOMMENDED NEXT STEPS

Updated for Data Science Fellowship Cohort 10 | Classes for Cohort 10 start on September 12, 2022.

  • If you’re ready to dive in
    • Enroll in the Data Science Fellowship via the sign up link here and take the assessment exam.
      • Note: The assessment exam is a key part of your application. The deadline for the assessment is on August 21, 2022.
  • If you want to know more

YOUR NEXT READ