Tag: classification

Kaggle – วิธีการใช้ Logistic Regression บนข้อมูล Iris

kanakorn.h

July 19, 2018
ข้อมูล Iris Dataset มักจะใช้ในการเริ่มต้นศึกษาการใช้งาน เครื่องมือทาง Data Science โดยเฉพาะ Classification เพราะไม่ซับซ้อน มี 4 ฟิลด์ ที่ใช้เป็น Features และมี 1 ฟิลด์ ที่จะเป็น Class (มี 3 Categories)
1. เริ่มจาก New Kernel
2. ในที่นี้ เลือก Notebook
3. จากนั้น เลือก Add Dataset จากที่เค้ามีให้ หรือ จะ Upload ขึ้นไปก็ได้
4. จากนั้น ข้อมูลของเราจะมาอยู่ที่ ../input/ ในกรณีเรามีไฟล์ ../input/iris.data
  จาก Code ที่ให้มาในเบื้องต้น ให้กดปุ่ม Shift+Enter หรือ กดเครื่องหมาย Run ด้าน ซ้ายมือ ก็จะได้ผลดังนี้
5. จากนั้น มาเขียน Code กัน เริ่มจาก Import Package ที่ต้องใช้
```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
```
6. สร้างตัวแปร iris อ่านข้อมูลจากไฟล์
```
iris = pd.read_csv('../input/iris.data')
```
7. สำรวจข้อมูลเบื้องต้น
  iris.head()
  iris.info()
  iris.describe()
8. ลองทำ Data Visualization เบื้องต้น ด้วย pairplot แยกตามสีของ species
```
sns.pairplot(iris, hue='species')
```
  หรือ จะดูเป็น scatterplot
```
plt.scatter(iris['sepal_length'], iris['sepal_width'], marker='.', color='r')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
```
9. ต่อไป เป็นขั้นตอนการแบ่งข้อมูลออกเป็น 2 ส่วน สำหรับ Train และ Test
```
from sklearn.model_selection import train_test_split
X = iris.drop(['species'], axis=1)
Y = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.3)
```
10. จากนั้น Train Model
```
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
```
11. แล้วก็ ตรวจสอบความแม่นยำ Model Evaluation
```
prediction = model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
```
ขั้นตอนไม่ยากครับ ส่วนว่าเราจะเลือกใช้ Model ไหน ทำอะไร อันนี้ต้องมาดูรายละเอียดกันต่อครับ
July 19, 2018
การใช้งาน Google Datalab Notebook บน Dataproc เพื่อสร้าง Machine Learning Model เบื้องต้น

kanakorn.h

July 16, 2018
ต่อจาก สร้าง Hadoop และ Spark Cluster เพื่องานด้าน Data Science ด้วย Google Cloud Dataproc + Datalab
1. จาก Google Cloud Datalab คลิก Notebookแล้ว ตั้งชื่อ Demo01
  
  เลือได้ว่า จะใช้ Python2 หรือ Python3 ในที่นี้จะเลือก Python3
2. ตรวจสอบรุ่นของ Spark ที่ใช้งานด้วยคำสั่ง
```
spark.version
```
  แล้วกดปุ่ม Shift+Enter เพื่อ Run
3. สามารถใช้คำสั่งไปย้ง Shell ซึ่งเป็น Linux ได้ โดยใช้เครื่องหมาย ! นำหน้า
  ในที่นี้ จะ Download iris dataset จาก https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data มาไว้ในเครื่อง mycluster-m ด้วย คำสั่ง
```
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
```
  แล้ว เอาไปใส่ใน HDFS ด้วยคำสั่ง
```
! hdfs dfs -put iris.data /
```
  จะได้ผลประมาณนี้
4. จาก Machine Learning #01 – Python with iris dataset ซึ่งเดิมใช้ sklearn จะเปลี่ยนเป็น Spark MLlib เพื่อใช้ความสามารถของ Spark Cluster ได้ เริ่มต้นจาก Import Library ที่จำเป็นดังนี้
```
# Import Libaries
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
```
5. จากนั้น สร้าง Spark Dataframe (Concept จะคล้ายกับ Pandas แต่มีรายละเอียดที่มากกว่า)
```
# get into DataFrame
csvFile = spark.read.csv('/iris.data', inferSchema=True)
diz = {"Iris-setosa":"1", "Iris-versicolor":"2", "Iris-virginica":"3" }
df = csvFile.na.replace(diz,1,"_c4")
df2 = df.withColumn("label",df["_c4"].cast(IntegerType())) \
.withColumnRenamed("_c0","sepal_length") \
.withColumnRenamed("_c1","sepal_width") \
.withColumnRenamed("_c2","petal_length") \
.withColumnRenamed("_c3","petal_width") 
train,test = df2.randomSplit([0.75,0.25])
```
  เริ่มจาก ให้ spark session (spark) อ่านไฟล์ CSV จาก HDFS /iris.data โดยระบุว่า ให้กำหนด Data Type อัตโนมัติ (inforSchema=True) และไฟล์นี้ไม่มี Header
  
  Dataset นี้ ประกอบด้วย 5 columns เมื่อ Spark อ่านข้อมูลเข้ามา จะตั้งชื่อ column เป็น _c0, _c1, _c2, _c3, _c4 โดย _c4 จะเป็น label ของชนิดของดอก iris ซึ่งกำหนดเป็น String => Iris-setosa, Iris-vesicolor, Iris-virginica ในการใช้งาน Logistic Regression ขั้นตอนต่อไป ไม่สามารถนำเข้าข้อมูลชนิด String เพื่อไปใช้งานได้ จึงต้องทำการเปลี่ยน จาก “Iris-setosa” เป็น “1” แล้วทำการเปลี่ยน “1” ซึ่งเป็น String ให้เป็น Integer ด้วย ฟังก์ชั่น cast และตั้งชื่อว่า column ว่า “label”
  
  จากนั้น ทำการเปลี่ยนชื่อ column _c0, _c1, _c2, _c3 เป็นชื่อตามต้องการ
  
  สุดท้าย ใช้ randomSplit([0.75, 0.25]) เพื่อแบ่งข้อมูลสำหรับ train 75% และ test 25%
6. ลอง แสดง Schema ดู
```
df2.printSchema()
```
  ได้ผลดังนี้
  
  และใช้คำสั่งนี้ เพื่อดูข้อมูล
```
df2.show()
```
  ได้ผลประมาณนี้
7. ใน Spark 2.x จะมี Concept ของการใช้ Pipeline เพื่อให้สามารถออกแบบการทดลอง ปรับค่า Meta Parameter ต่าง ๆ ของโมเดล และทำงานอย่างเป็นระบบยิ่งขึ้น (ในขั้นตอนนี้ ขอไม่ปรับค่าใด ๆ ก่อน)
```
# Model
assembler = VectorAssembler(
inputCols=["sepal_length","sepal_width","petal_length","petal_width"],
outputCol="features")
lr = LogisticRegression()
paramGrid = ParamGridBuilder().build()

#Pipeline
pipeline = Pipeline(stages=[assembler, lr])
```
  ในการใช้งาน Logistic Regression ต้องกำหนดค่า field คือ features โดยกำหนดให้มาจาก Column sepal_length, sepal_width, petal_length, petal_width ส่วน label ได้กำหนดในขั้นก่อนหน้าแล้ว
  
  จากนั้นสร้าง lr เป็น instant ของ LogisticRegression
  
  ในการปรับค่า Parameter จะมาใส่ใน ParamGridBuilder ซึ่งจะไม่กล่าวถึงในขั้นนี้
  
  สุดท้าย นำ assembler และ lr มาเข้าสู่ stage วิธีการนี้ทำให้การทำซ้ำขั้นตอนต่าง ๆ ใน Pipeline สะดวกยิ่งขึ้น (ต้องเห็นกระบวนการที่ซับซ้อนกว่านี้ จึงจะเห็นประโยชน์)
8. ขั้นตอนสำคัญ pipeline มาแล้ว ก็ต้องนำมาสร้าง model โดยการ Train ด้วยชุดข้อมูล “train”
```
model = pipeline.fit(train)
predictions = model.transform(train)
```
  แล้ว นำ model ที่ได้ มาทดลอง predictions ด้วย transform() บนข้อมูล train ผลที่ได้ คือ ผลการ Predict จาก Model
9. ต่อไป คือ การตรวจสอบว่า Model ที่สร้างขึ้น มีความแม่นยำแค่ไหน ในที่นี้ จะใช้ MulticlassClassificationEvaluator เพราะ label มีมากว่า 2 ชนิด
```
evaluator=MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label")
```
  แล้วนำ เปรียบเทียบว่า สิ่งที่ predict ได้จาก model
```
evaluator.evaluate(predictions)
```
  ถูกต้องมากน้อยขนาดไหน กับข้อมูล test
```
evaluator.evaluate(model.transform(test))
```
10. ผลที่ได้ ประมาณนี้
  โดยจะเห็นได้ว่า มีความถูกต้อง 0.9521 … หรือ 95.21% นั่นเอง
July 16, 2018
Machine Learning #01 – Python with iris dataset

kanakorn.h

September 26, 2017
ในบทความนี้ จะแนะนำวิธีการสร้างกระบวนการ Machine Learning ด้วย Python โดยใช้ iris dataset ตั้งแต่การโหลดข้อมูล, สร้าง Model, Cross Validation, วัด Accuracy และการนำ Model ไปใช้งาน

เพื่อความสะดวกในการเรียนรู้ เราจะเลือกใช้ Anaconda ซึ่งเป็น Python Data Science Platform ซึ่งจะรวบรวมเครื่องมือ และ Library ที่จำเป็นต่อการพัฒนา โดยสามารถเลือก Download รุ่นที่เหมาะกับระบบปฏบัติการของท่านได้ที่ https://www.anaconda.com/download/

สามารถ Clone Repository ตัวอย่างทั้งหมดที่กล่าวถึงในบทความนี้ได้จาก https://github.com/nagarindkx/pythonml

และ แนะนำให้ใช้งาน jupyter-notebook เพื่อสะดวกในการเรียนรู้

บทความนี้ใช้ Notebook: 01 – SVM with iris dataset.ipynb

เริ่มจาก import dataset “iris” จาก SciKit

ซึ่งเป็น dataset ตัวอย่างทีดี ในการสร้างระบบ Predict ชนิดของดอกไม้ จากการป้อนค่า ความกว้างและความยาวของกลีบดอก Iris (รายละเอียดอ่านได้จาก https://en.wikipedia.org/wiki/Iris_flower_data_set) ซึ่งเป็นการวัดความกว้าง และ ความยาวของกลีบดอก ของดอก “iris” (sepal width, sepal length, petal width, petal length) ใน 3 Spicy

Image Source: https://en.wikipedia.org/wiki/Iris_flower_data_set

ชุด iris dataset นี้ มักจะใช้ในการเริ่มต้นเรียนรู้ กระบวนการสร้าง Machine Learning เพื่อการ Classification โดยในตัวอย่างนี้จะใช้ Support Vector Machine (SVM) โดยเมื่อสร้างและ Train Model เสร็จแล้ว สามารถนำ Model นี้ไปใช้ในการ จำแนก Species ได้ โดยการระบุ ความกว้างและความยาวดังกล่าว แล้วระบบจะตอบมาได้ว่า เป็น Species ใด

ในการเริ่มต้นนี้ เราจะใช้ iris dataset ที่มาพร้อมกับ SciKit (sklearn) ซึ่งเป็น Machine Learning Package ในภาษา Python (ซึ่งติดตั้งมาในชุดของ Anaconda เรียบร้อยแล้ว)

นำเข้าข้อมูล
```
from sklearn import datasets
iris = datasets.load_iris()
```
สำรวจข้อมูล
```
print(iris.data)
print(iris.target)
print(iris.data.shape)
print(iris.target.shape)
```
ใช้งาน SVM (Support Vector Machine)

สร้าง SVC (Support Vector Classification) เพื่อทำการ Training ด้วยคำสั่ง fit โดยใส่ค่า data และ target ลงไป
```
from sklearn import svm
clf = svm.SVC()
clf.fit(iris.data, iris.target)
```
ผลที่ได้คือ
```
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
```
ทดลองทำการ Predict

ด้วยคำสั่ง predict แล้วใส่ Array ข้อมูลลงไป
```
print(clf.predict([[ 6.3 , 2.5, 5., 1.9]]))
```
ซึ่งระบบจะตอบออกมาเป็น
```
[2]
```
ต้องการแสดงผลเป็นชื่อของ Target

ต้องทำในขั้นตอน fit ดังนี้
```
clf.fit(iris.data, iris.target_names[ iris.target])
print(clf.predict([[ 6.3 , 2.5, 5., 1.9]]))
```
ผลที่ได้คือ
```
['virginica']
```
ทำการ Cross Validation

โดยแบ่งข้อมูลเป็นสองส่วน คือ ส่วน Train และ Test ทั้ง X และ Y จากนั้น ใช้ Function “fit” ในการ Train
```
from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4 , random_state=0)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

clf.fit(x_train, y_train)
```
ผลที่ได้คือ
(90, 4) (60, 4) (90,) (60,) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

ทดสอบความแม่นยำ

ด้วยการ นำข้อมูลส่วน Test ไปทดสอบใน Model ด้วย Function “score”

print(clf.score(x_test, y_test))

ผลที่ได้คือ

0.95

นำ Model ที่สร้างเสร็จไปใช้ต่อ

ใช้กระบวนการ pickle หรือ serialization

import pickle pickle.dump(clf, open("myiris.pickle","wb"))

ซึ่ง ก็จะได้ไฟล์ “myiris.pickle” สามารถนำไปใช้งานต่อได้
ในบทความต่อไป จะกล่าวถึง การนำ Model นี้ไปใช้งานผ่าน django REST Framework
September 26, 2017