Statsmodels in Data Science

Statsmodels in Data Science

Statsmodels is one of the most important library in Python’s Data Science ecosystem, inspired by the R Statistical Programming language Statsmodels is the most powerful python library for dealing with Time series Data. Statmodels is very useful in performing Advance Statistical test and estimating statistical models. Statsmodels works alongside Pandas and Numpy (both data science libraries in Python) and is pre-installed in the Anaconda’s Jupyter notebook. How to import and use Statsmodels It is advisable and recommended that we import Statsmodels with Numpy and Pandas

Import pandas as pd
Import numpy as np
Import statsmodels.api as sm
Import statsmodels.formula.api as smf

Importing Statsmodels.api will load most of the public parts of Statsmodels. This makes most functions and classes conveniently available within one or two levels The statsmodels.formula.api in addition to the usual statsmodels.api is used to load many of functions found in the api but holds lower case counterparts for most of these models e.g OLS. Statmodels can be used to perform advance statistical calculations such as

Regression and Linear Models

  • Linear Regression.
  • Generalized Linear Models.
  • Generalized Estimated Equations.
  • Generalized Additive Models (GAM).
  • Robust Linear Models.
  • Linear Mixed Effect Models.
  • Regression with Discrete Dependent variable.

    ANOVA

  • Generalized Linear Mixed Effect Models.

    Time Series Analysis

  • Time series Analysis by state space
  • Vector Autoregressions

    Other Models

  • Method for Survival and Duration Analysis
  • Non Parametric Methods

    for more link below

statsmodels.org/stable/user-guide.html

Performing a simple linear regression with Statsmodels

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt     // assist in plotting variables
loading data using Pandas library
data = pd.read_csv('Simple linear regression data.csv')

y = data ['dependent-variable']
x1 = data ['independent-variable']
// Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
//  Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
// Print a nice summary of the regression. That's one of the strong points of statsmodels -> the summaries
results.summary()
// Create a scatter plot
plt.scatter(x1,y) // Define the regression equation, so we can plot it later
yhat = 0.0017*x1 + 0.275  // assumed coefficient for this post
// Plot the regression line against the independent variable 
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')
// Label the axes
plt.xlabel('independent-variable', fontsize = 20)
plt.ylabel('dependent variable', fontsize = 20)
plt.show()

Performing multiple linear regression

y = data ['dependent-variable']
x1 = data [['independent-variable-1','independent-variable-2']]
// Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
// Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
results.summary()

this is just a brief summary of what the statsmodel library can do, hopefully in later post we will explore more analysis with statsmodels like the Time Series Analysis, ANOVA and other tests like Autocorrelation and partial autocorrelation test, VIF (test for Multicollinearity), Durbin Watson test and so more.