Feb 6, 2019

# Your first Machine Learning Program using StatsModels and Scikit_Learn!

Are you excited yet? 😍

You are about to write your first machine learning program using python!

*If you haven’t already installed python packages for machine learning, you can find how to do it **here**! 🙌🙌*

Machine learning in today’s world is all about building models using algorithms that uncovers connections, systems, and organizations to make better decisions without any human intervention. It is primarily concerned with the accuracy and effectiveness of the computer system. Machine learning is considered to be the subset of artificial intelligence.

*Getting that out of the way, let’s write our first program!*

# 1. Hydrocarbon Escape:

Download *hydrocarbon_data**,** *as this contains the data we’ll be using for this program.

*//Understanding the problem*Hydrocarbons are organic compounds found in crude oil and natural gas. Hydrocarbons are made of hydrogens and carbons.

To keep hydrocarbons from escaping from a car, a catalytic converter completes the combustion of unburned hydrocarbons. Hydrocarbons also come from the bacterial decomposition of organic matter, forest fires, and vegetation. Incomplete combustion of gasoline-fueled vehicles and industrial emissions account for 1/6th of all atmospheric hydrocarbons.

While hydrocarbons cause no harmful effects, they can undergo chemical reactions in the presence of sunlight and nitrogen oxides. They can contribute to smog and the formation of toxic formaldehyde.

Today, we will use linear regression to model the escape of hydrocarbons when fuel is pumped into a car.

*Our data attributes will include:*

*Tank Temperature (degrees Fahrenheit)*

Petrol Temperature (degrees Fahrenheit)

Initial Tank Pressure (pounds/square inch)

Petrol Pressure (pounds/square inch)

Hydrocarbons Escaping (grams)

Petrol Temperature (degrees Fahrenheit)

Initial Tank Pressure (pounds/square inch)

Petrol Pressure (pounds/square inch)

Hydrocarbons Escaping (grams)

*We’ll write this program using **Statsmodel **and **Scikit_Learn**.*

## The program begins here!

- Open terminal (run as administrator) and open ipython.

*ipython*

**2.** First, we need to import *NumPy, Pandas,* and *PyPlot from Matplotlib*. Then, we display the first ten elements from the hydrocarbon_data file.

**NumPy: (***According to Wikipedia) NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with an extensive collection of high-level mathematical functions to operate on these arrays.*

**Pandas:** (*According to Wikipedia) In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.*

**Matplotlib:** (*According to Wikipedia) Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.PyPlot provides MATLAB like plotting framework.*

*import numpy as np*

import pandas as pd

from matplotlib import pyplot as plt

df = pd.read_csv("*<Your file location>**/hydrocarbon_data.csv")*

df.head(n=10)

## Explanation:

*We are importing essential packages which are used to read/store the hydrocarbon_data in DataFrames (df). We then display the first 10 elements using df.head(n=10).*

*(Default value of n = 5)*

**3.** Plot the data as independent versus dependent variables.

*fig, axs = plt.subplots(1, 4, sharey=True)*

df.plot(kind='scatter', x='ttemp', y='hesc', ax=axs[0], figsize=(16, 2))

df.plot(kind='scatter', x='ptemp', y='hesc', ax=axs[1])

df.plot(kind='scatter', x='tpres', y='hesc', ax=axs[2])

df.plot(kind='scatter', x='ppres', y='hesc', ax=axs[3])

plt.show()

## Explanation:

*fig, axs** = **plt.subplots(1, 4, sharey=True)*

*It returns a tuple*

*fig, axs**which is unpacked in two variables using the notation.*

The arguments in subplots, communicate the following about the output figure:

The arguments in subplots, communicate the following about the output figure:

*(**number_of_rows_in_figure*

*= 1,**number_of_columns_in_figure*

*= 4,**Should the y-axis be same throughout the figure?*

*=[sharey=True]).**df.plot(kind=’scatter’, x=’ptemp’, y=’hesc’, ax=axs[1])*

*It plots a scatter graph(*

*kind=’scatter’**) on the graph in column 2 of row 1 of figure(*

*ax=axs[1]**). Graphing ptemp in x axis and hesc in y axis(*

*x=’ptemp’, y=’hesc’**).*

## Difference between plt.show() and plt.draw()

** plt.show(): **It

**displays the current figure that you are working on.**

** plt.draw(): **It

**re-draws the figure. This allows you to work in interactive mode and, should you have changed your data or formatting, allow the graph itself to change.**

4. Now we’ll utilize statsmodels, so let’s begin by importing it.

**Statsmodels**: *is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.*

*import statsmodels.formula.api as smf*

lm = smf.ols(formula='hesc ~ ttemp', data=df).fit()

## Explanation:

** lm = smf.ols(formula = ‘hesc ~ ttemp’, data=df).fit()**The formula is written with

**syntax. Data is the data to be used.**

*patsy***smf.ols**takes the formula string and the DataFrame, live and returns an OLS object representing the model. The name ols stands for “ordinary least squares.”

**fit()**method fits the model to the data and returns an object containing the regression results.

5. Now, copy paste the below code line by line.

*lm.params*

lm.conf_int()

lm.pvalues

lm.rsquared

## Hence,

1 unit increase in ttemp is associated with a 0.396703 increase in hesc.

y = 8.153444 + 0.396703 * x

(29) 21.244643 = 8.153444 + 0.396703 * 33

6. You can get the summary by typing,

*lm.summary()*

7. To predict, plot and display ttemp data in the graph.

*xpred = df['ttemp']*

preds = lm.predict(xpred)

df.plot(kind='scatter',x='index',y='hesc')

plt.plot(df['index'], preds, c='red', linewidth=2)

plt.show()

8. Predicting more data.

*lm2 = smf.ols(formula='hesc ~ ttemp + tpres', data=df).fit()*

xpred2 = df[['ttemp','tpres']]

preds2 = lm2.predict(xpred2)

lm4 = smf.ols(formula='hesc ~ ttemp + ptemp + tpres + ppres', data=df).fit()

xpred4 = df[['ttemp','ptemp', 'tpres', 'ppres']]

preds4 = lm4.predict(xpred4)

9. Plot and display all the predictions.

*df.plot(kind='scatter',x='index',y='hesc')*

plt.plot(df['index'], preds, c='red', linewidth=2)

plt.plot(df['index'], preds2, c='blue', linewidth=2)

plt.plot(df['index'], preds4, c='green', linewidth=2)

plt.show()

10. Another prediction.

*feature_cols = ['ttemp','ptemp','tpres','ppres']*

xpred5 = df[feature_cols]

ypred5 = df.hesc

11. Now we import **Scikit_Learn.**

*from sklearn.linear_model import LinearRegression*

lm5 = LinearRegression()

lm5.fit(xpred5,ypred5)

12. Copy-paste the below lines one by one.

*lm5.intercept_*

lm5.coef_

lm5.score(xpred5,ypred5,sample_weight=None)

13. Using scikit to plot a graph.

preds5 = lm5.predict(xpred5)df.plot(kind='scatter',x='index',y='hesc')

plt.plot(df['index'], preds, c='red', linewidth=2)

plt.plot(df['index'], preds2, c='blue', linewidth=2)

plt.plot(df['index'], preds4, c='green', linewidth=2)

plt.plot(df['index'], preds5, c='orange', linewidth=2)

plt.show()

14.* Go ahead and experiment on your own with linear regression and see if you can get a better line model (try with one of the other attributes, 2 different attributes, 3 attributes, and 4 attributes). Watch the R Square value as you change models. For each experiment, give the attributes used, the equation of the line, the R Square value, and how well the model predicts.*

## Experiment:

*Which line model predicted the best?How many independent attributes are good to use when trying to predict the dependent attributes?*

By doing this I hope you have a basic idea of what machine learning is like.

**You have successfully written your first machine learning program!Follow me and give me some claps if this helped you 👏👏👏 !!**