Stephen Joel

Feb 6, 2019

6 min read

Your first Machine Learning Program using StatsModels and Scikit_Learn!

Are you excited yet? 😍
You are about to write your first machine learning program using python!

If you haven’t already installed python packages for machine learning, you can find how to do it here! 🙌🙌

achine learning in today’s world is all about building models using algorithms that uncovers connections, systems, and organizations to make better decisions without any human intervention. It is primarily concerned with the accuracy and effectiveness of the computer system. Machine learning is considered to be the subset of artificial intelligence.

Getting that out of the way, let’s write our first program!

1. Hydrocarbon Escape:

Download hydrocarbon_data, as this contains the data we’ll be using for this program.

//Understanding the problem
Hydrocarbons are organic compounds found in crude oil and natural gas. Hydrocarbons are made of hydrogens and carbons.

To keep hydrocarbons from escaping from a car, a catalytic converter completes the combustion of unburned hydrocarbons. Hydrocarbons also come from the bacterial decomposition of organic matter, forest fires, and vegetation. Incomplete combustion of gasoline-fueled vehicles and industrial emissions account for 1/6th of all atmospheric hydrocarbons.

While hydrocarbons cause no harmful effects, they can undergo chemical reactions in the presence of sunlight and nitrogen oxides. They can contribute to smog and the formation of toxic formaldehyde.

Today, we will use linear regression to model the escape of hydrocarbons when fuel is pumped into a car.

Our data attributes will include:
Tank Temperature (degrees Fahrenheit)
Petrol Temperature (degrees Fahrenheit)
Initial Tank Pressure (pounds/square inch)
Petrol Pressure (pounds/square inch)
Hydrocarbons Escaping (grams)

We’ll write this program using Statsmodel and Scikit_Learn.

The program begins here!

  1. Open terminal (run as administrator) and open ipython.

2. First, we need to import NumPy, Pandas, and PyPlot from Matplotlib. Then, we display the first ten elements from the hydrocarbon_data file.

NumPy: (According to Wikipedia) NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with an extensive collection of high-level mathematical functions to operate on these arrays.

Pandas: (According to Wikipedia) In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Matplotlib: (According to Wikipedia) Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.
PyPlot provides MATLAB like plotting framework.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("
<Your file location>/hydrocarbon_data.csv")


We are importing essential packages which are used to read/store the hydrocarbon_data in DataFrames (df). We then display the first 10 elements using df.head(n=10).

(Default value of n = 5)

This shows the default value of df.head()

3. Plot the data as independent versus dependent variables.

fig, axs = plt.subplots(1, 4, sharey=True)
df.plot(kind='scatter', x='ttemp', y='hesc', ax=axs[0], figsize=(16, 2))
df.plot(kind='scatter', x='ptemp', y='hesc', ax=axs[1])
df.plot(kind='scatter', x='tpres', y='hesc', ax=axs[2])
df.plot(kind='scatter', x='ppres', y='hesc', ax=axs[3])


fig, axs = plt.subplots(1, 4, sharey=True)
It returns a tuple fig, axs which is unpacked in two variables using the notation.
The arguments in subplots, communicate the following about the output figure:
(number_of_rows_in_figure = 1, number_of_columns_in_figure = 4, Should the y-axis be same throughout the figure? =[sharey=True]).

df.plot(kind=’scatter’, x=’ptemp’, y=’hesc’, ax=axs[1])
It plots a scatter graph(kind=’scatter’) on the graph in column 2 of row 1 of figure(ax=axs[1]). Graphing ptemp in x axis and hesc in y axis(x=’ptemp’, y=’hesc’).

Difference between and plt.draw() It displays the current figure that you are working on.

plt.draw(): It re-draws the figure. This allows you to work in interactive mode and, should you have changed your data or formatting, allow the graph itself to change.

Your scatterplot graph should look something like this.

4. Now we’ll utilize statsmodels, so let’s begin by importing it.

Statsmodels: is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

import statsmodels.formula.api as smf
lm = smf.ols(formula='hesc ~ ttemp', data=df).fit()


lm = smf.ols(formula = ‘hesc ~ ttemp’, data=df).fit()
The formula is written with patsy syntax. Data is the data to be used.
smf.ols takes the formula string and the DataFrame, live and returns an OLS object representing the model. The name ols stands for “ordinary least squares.”
fit() method fits the model to the data and returns an object containing the regression results.

5. Now, copy paste the below code line by line.


1 unit increase in ttemp is associated with a 0.396703 increase in hesc.
y = 8.153444 + 0.396703 * x
(29) 21.244643 = 8.153444 + 0.396703 * 33

Null hypothesis: no relationship between ttemp and hesc reject.

6. You can get the summary by typing,

Program Summary

7. To predict, plot and display ttemp data in the graph.

xpred = df['ttemp']
preds = lm.predict(xpred)
plt.plot(df['index'], preds, c='red', linewidth=2)

8. Predicting more data.

lm2 = smf.ols(formula='hesc ~ ttemp + tpres', data=df).fit()
xpred2 = df[['ttemp','tpres']]
preds2 = lm2.predict(xpred2)
lm4 = smf.ols(formula='hesc ~ ttemp + ptemp + tpres + ppres', data=df).fit()
xpred4 = df[['ttemp','ptemp', 'tpres', 'ppres']]
preds4 = lm4.predict(xpred4)

9. Plot and display all the predictions.

plt.plot(df['index'], preds, c='red', linewidth=2)
plt.plot(df['index'], preds2, c='blue', linewidth=2)
plt.plot(df['index'], preds4, c='green', linewidth=2)

10. Another prediction.

feature_cols = ['ttemp','ptemp','tpres','ppres']
xpred5 = df[feature_cols]
ypred5 = df.hesc

11. Now we import Scikit_Learn.

from sklearn.linear_model import LinearRegression
lm5 = LinearRegression(),ypred5)

12. Copy-paste the below lines one by one.


13. Using scikit to plot a graph.

preds5 = lm5.predict(xpred5)df.plot(kind='scatter',x='index',y='hesc')
plt.plot(df['index'], preds, c='red', linewidth=2)
plt.plot(df['index'], preds2, c='blue', linewidth=2)
plt.plot(df['index'], preds4, c='green', linewidth=2)
plt.plot(df['index'], preds5, c='orange', linewidth=2)
Orange line overwrites the green line

14. Go ahead and experiment on your own with linear regression and see if you can get a better line model (try with one of the other attributes, 2 different attributes, 3 attributes, and 4 attributes). Watch the R Square value as you change models. For each experiment, give the attributes used, the equation of the line, the R Square value, and how well the model predicts.


Which line model predicted the best?
How many independent attributes are good to use when trying to predict the dependent attributes?

By doing this I hope you have a basic idea of what machine learning is like.

You have successfully written your first machine learning program!
Follow me and give me some claps if this helped you 👏👏👏 !!