6. AI AND MACHINE LEARNING VTU LAB | READ NOW → VTULOOP

MACHINE LEARNING VTU LAB – Naive Bayesian Classifier(using API)

Program 6. ASSUMING A SET OF DOCUMENTS THAT NEED TO BE CLASSIFIED, USE THE NAÏVE BAYESIAN CLASSIFIER MODEL TO PERFORM THIS TASK. BUILT-IN JAVA CLASSES/API CAN BE USED TO WRITE THE PROGRAM. CALCULATE THE ACCURACY, PRECISION, AND RECALL FOR YOUR DATA SET.

Table of Contents

Program Code – lab6.py

import pandas as pd

msg = pd.read_csv('document.csv', names=['message', 'label'])

print("Total Instances of Dataset: ", msg.shape[0])

msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

X = msg.message

y = msg.labelnum

from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()

Xtrain_dm = count_v.fit_transform(Xtrain)

Xtest_dm = count_v.transform(Xtest)

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())

print(df[0:5])

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(Xtrain_dm, ytrain)

pred = clf.predict(Xtest_dm)

for doc, p in zip(Xtrain, pred):

p = 'pos' if p == 1 else 'neg'

print("%s -> %s" % (doc, p))

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

print('Accuracy Metrics: \n')

print('Accuracy: ', accuracy_score(ytest, pred))

print('Recall: ', recall_score(ytest, pred))

print('Precision: ', precision_score(ytest, pred))

print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

import pandas as pd msg = pd.read_csv('document.csv', names=['message', 'label']) print("Total Instances of Dataset: ", msg.shape[0]) msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0}) X = msg.message y = msg.labelnum from sklearn.model_selection import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X, y) from sklearn.feature_extraction.text import CountVectorizer count_v = CountVectorizer() Xtrain_dm = count_v.fit_transform(Xtrain) Xtest_dm = count_v.transform(Xtest) df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names()) print(df[0:5]) from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(Xtrain_dm, ytrain) pred = clf.predict(Xtest_dm) for doc, p in zip(Xtrain, pred): p = 'pos' if p == 1 else 'neg' print("%s -> %s" % (doc, p)) from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score print('Accuracy Metrics: \n') print('Accuracy: ', accuracy_score(ytest, pred)) print('Recall: ', recall_score(ytest, pred)) print('Precision: ', precision_score(ytest, pred)) print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

import pandas as pd
msg = pd.read_csv('document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)

for doc, p in zip(Xtrain, pred):
    p = 'pos' if p == 1 else 'neg'
    print("%s -> %s" % (doc, p))

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

MACHINE LEARNING Program Execution – lab6.ipynb

Jupyter Notebook program execution.

import pandas as pd

msg = pd.read_csv('document.csv', names=['message', 'label'])

print("Total Instances of Dataset: ", msg.shape[0])

msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

import pandas as pd msg = pd.read_csv('document.csv', names=['message', 'label']) print("Total Instances of Dataset: ", msg.shape[0]) msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

import pandas as pd
msg = pd.read_csv('document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})

Total Instances of Dataset: 18

X = msg.message

y = msg.labelnum

from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()

Xtrain_dm = count_v.fit_transform(Xtrain)

Xtest_dm = count_v.transform(Xtest)

X = msg.message y = msg.labelnum from sklearn.model_selection import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X, y) from sklearn.feature_extraction.text import CountVectorizer count_v = CountVectorizer() Xtrain_dm = count_v.fit_transform(Xtrain) Xtest_dm = count_v.transform(Xtest)

X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer

count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())

print(df[0:5])

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names()) print(df[0:5])

df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])

about am an and awesome bad beers best boss can … tired to \
0 0 1 0 1 0 0 0 0 0 0 … 1 0
1 0 0 0 0 0 0 0 0 0 0 … 0 0
2 0 0 0 0 0 0 0 0 0 0 … 0 0
3 0 0 0 0 0 0 0 0 0 1 … 0 0
4 0 0 0 0 0 0 0 0 0 0 … 0 0

today tomorrow very we went will with work
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 0

[5 rows x 49 columns]

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(Xtrain_dm, ytrain)

pred = clf.predict(Xtest_dm)

from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(Xtrain_dm, ytrain) pred = clf.predict(Xtest_dm)

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)

for doc, p in zip(Xtrain, pred):

p = 'pos' if p == 1 else 'neg'

print("%s -> %s" % (doc, p))

for doc, p in zip(Xtrain, pred): p = 'pos' if p == 1 else 'neg' print("%s -> %s" % (doc, p))

for doc, p in zip(Xtrain, pred):
    p = 'pos' if p == 1 else 'neg'
    print("%s -> %s" % (doc, p))

I am sick and tired of this place -> pos
I do not like the taste of this juice -> neg
I love this sandwich -> neg
I can’t deal with this -> pos
I do not like this restaurant -> neg

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

print('Accuracy Metrics: \n')

print('Accuracy: ', accuracy_score(ytest, pred))

print('Recall: ', recall_score(ytest, pred))

print('Precision: ', precision_score(ytest, pred))

print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score print('Accuracy Metrics: \n') print('Accuracy: ', accuracy_score(ytest, pred)) print('Recall: ', recall_score(ytest, pred)) print('Precision: ', precision_score(ytest, pred)) print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

Accuracy Metrics:

Accuracy: 0.6
Recall: 0.5
Precision: 1.0
Confusion Matrix:
[[1 0]
[2 2]]

Alternative- alt lab6.ipynb

from sklearn.datasets import fetch_20newsgroups

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

import numpy as np

from sklearn.datasets import fetch_20newsgroups from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import numpy as np

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',categories=categories,shuffle=True)

twenty_test = fetch_20newsgroups(subset='test',categories=categories,shuffle=True)

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train',categories=categories,shuffle=True) twenty_test = fetch_20newsgroups(subset='test',categories=categories,shuffle=True)

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories,shuffle=True)
twenty_test = fetch_20newsgroups(subset='test',categories=categories,shuffle=True)

print(len(twenty_train.data))

print(len(twenty_test.data))

print(twenty_train.target_names)

print("\n".join(twenty_train.data[0].split("\n")))

print(twenty_train.target[0])

print(len(twenty_train.data)) print(len(twenty_test.data)) print(twenty_train.target_names) print("\n".join(twenty_train.data[0].split("\n"))) print(twenty_train.target[0])

print(len(twenty_train.data))
print(len(twenty_test.data))
print(twenty_train.target_names)
print("\n".join(twenty_train.data[0].split("\n")))
print(twenty_train.target[0])

2257
1502
[‘alt.atheism’, ‘comp.graphics’, ‘sci.med’, ‘soc.religion.christian’]
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format. We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance. Michael.

Michael Collier (Programmer) The Computer Unit,
Email: M.P.Collier@uk.ac.city The City University,
Tel: 071 477-8000 x3769 London,
Fax: 071 477-8565 EC1V 0HB.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_tf = count_vect.fit_transform(twenty_train.data)

from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() X_train_tf = count_vect.fit_transform(twenty_train.data)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_tf = count_vect.fit_transform(twenty_train.data)

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)

X_train_tfidf.shape

from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf) X_train_tfidf.shape

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_tf)
X_train_tfidf.shape

(2257, 35788)

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

from sklearn import metrics

mod = MultinomialNB()

mod.fit(X_train_tfidf, twenty_train.target)

X_test_tf = count_vect.transform(twenty_test.data)

X_test_tfidf = tfidf_transformer.transform(X_test_tf)

predicted = mod.predict(X_test_tfidf)

from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score from sklearn import metrics mod = MultinomialNB() mod.fit(X_train_tfidf, twenty_train.target) X_test_tf = count_vect.transform(twenty_test.data) X_test_tfidf = tfidf_transformer.transform(X_test_tf) predicted = mod.predict(X_test_tfidf)

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn import metrics
mod = MultinomialNB()
mod.fit(X_train_tfidf, twenty_train.target)
X_test_tf = count_vect.transform(twenty_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_tf)
predicted = mod.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(twenty_test.target, predicted))

print(classification_report(twenty_test.target,predicted,target_names=twenty_test.target_names))

print("confusion matrix is \n",metrics.confusion_matrix(twenty_test.target, predicted))

print("Accuracy:", accuracy_score(twenty_test.target, predicted)) print(classification_report(twenty_test.target,predicted,target_names=twenty_test.target_names)) print("confusion matrix is \n",metrics.confusion_matrix(twenty_test.target, predicted))

print("Accuracy:", accuracy_score(twenty_test.target, predicted))
print(classification_report(twenty_test.target,predicted,target_names=twenty_test.target_names))
print("confusion matrix is \n",metrics.confusion_matrix(twenty_test.target, predicted))

Accuracy: 0.8348868175765646
precision recall f1-score support

precision recall f1-score support

alt.atheism 0.97 0.60 0.74 319

comp.graphics 0.96 0.89 0.92 389

sci.med 0.97 0.81 0.88 396

soc.religion.christian 0.65 0.99 0.78 398

avg / total 0.88 0.83 0.84 1502

precision recall f1-score support alt.atheism 0.97 0.60 0.74 319 comp.graphics 0.96 0.89 0.92 389 sci.med 0.97 0.81 0.88 396 soc.religion.christian 0.65 0.99 0.78 398 avg / total 0.88 0.83 0.84 1502

                    precision    recall  f1-score   support

       alt.atheism       0.97      0.60      0.74       319
     comp.graphics       0.96      0.89      0.92       389
           sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398
       avg / total       0.88      0.83      0.84      1502

confusion matrix is
[[192 2 6 119]
[ 2 347 4 36]
[ 2 11 322 61]
[ 2 2 1 393]]

Download Dataset

Dataset