Dataset : https://www.kaggle.com/uciml/pima-indians-diabetes-database
Studi kasus : Klasifikasi Diabetes
jumlah fitur = 8
- Pregnancies: Jumlah kali hamil
- Glucose: Konsentrasi glukosa plasma dalam 2 jam saat tes toleransi glukosa oral
- BloodPressure: Tekanan darah diastolik (mm Hg)
- SkinThickness: Ketebalan lipatan kulit trisep (mm)
- Insulin: Insulin serum dalam 2 jam (mu U/ml)
- BMI: Indeks massa tubuh (berat dalam kg / (tinggi dalam meter)^2)
- DiabetesPedigreeFunction: Nilai fungsi silsilah diabetes
- Age: Umur (tahun)
- Outcome: Variabel target, 0 untuk tidak menderita diabetes dan 1 untuk menderita diabetes.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
# Load dataset
train_data = pd.read_csv("diabetes-train.csv")
test_data = pd.read_csv("diabetes-test.csv")
# Separate attributes and targets
X_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, -1].values
X_test = test_data.iloc[:, :-1].values
y_test = test_data.iloc[:, -1].values
# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Apply PCA
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
# Print 2 new features from PCA
#print("New feature 1 (PC1):", X_test[:, 0])
#print("New feature 2 (PC2):", X_test[:, 1])
# Print variance ratio of each component
print(pca.explained_variance_ratio_)
# Train SVM
svm = SVC(kernel='linear', random_state=0)
svm.fit(X_train, y_train)
# Make predictions on the test data
y_pred = svm.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
# Menghitung confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Menampilkan confusion matrix
print("Confusion matrix:")
print(cm)
# Print coefficients of the hyperplane
print("Coefficients of the hyperplane: ", svm.coef_)
# Plot decision boundary
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, s=20, edgecolor='k')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('SVM Decision Boundary')
plt.show()
# Print actual and predicted values for each test data point
#print("{:<10} {:<15} {}".format('Index', 'Actual Value', 'Predicted Value'))
#for i in range(len(y_test)):
# print("{:<10} {:<15} {}".format(i+1, y_test[i], y_pred[i]))
# Save actual and predicted values to txt file
with open('actual_pred.txt', 'w') as f:
f.write("{:<10} {:<15} {}\n".format('Index', 'Actual Value', 'Predicted Value'))
for i in range(len(y_test)):
f.write("{:<10} {:<15} {}\n".format(i+1, y_test[i], y_pred[i]))
print("Actual and predicted values saved to actual_pred.txt file")
# Save 2 new features to excel file
new_features = pd.DataFrame({'New feature 1 (PC1)': X_test[:, 0], 'New feature 2 (PC2)': X_test[:, 1]})
new_features.to_excel('new-features.xlsx', index=False)
Hasil Running Program
![]() |
Dua Fitur yang Dihasilkan oleh PCA |
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
![]() |
Standarisasi Nilai-Nilai Fitur |
No comments :
Post a Comment