This document summarizes a machine learning project to predict which Titanic passengers survived using data on the Titanic disaster. It describes the data used for modeling including passenger details like name, sex, age, class, etc. Various machine learning models are tested on a training and test set including XGBoost, SVC, RandomForest, and LogisticRegression. The best model is able to predict survivors with 87.4% accuracy by tuning the max_depth parameter to alleviate overfitting. In conclusion, the document discusses lessons learned about using Kaggle for machine learning projects and combining artificial and human intelligence.
3. 3
PassengerId 195
Survived 1
Pclass 1
Name*
Brown, Mrs. James
Joseph (Margaret Tobin)
Sex female
Age 44
SibSp 0
ParCh 0
Ticket** PC 17610
Fare 27.7208
Cabin*** B4
Embarked C
Train
PassengerId 972
Survived (need to predict)
Pclass 3
Name*
Boulos, Master.
Akar
Sex male
Age 6
SibSp 1
ParCh 1
Ticket** 2678
Fare 15.2458
Cabin***
Embarked C
Test Goal
PassengerId Survived
892 1/0
893 1/0
´´ ´´
1309 1/0
Margaret Brown In Titanic Movie
By Kathy Bates
*Title can be extracted from Name. **Ticket not informative, not used ***Cabin most missing, not used
Age, Fare: missing data replaced with median value; Embarked: missing data replaced with mode value
4. 4
Embarked from
S: Southampton C: Cherbourg Q: Queenstown
SibSp: # of Siblings or Spouse
ParCh: # of Parents or Children
Family Size = ????? + ????? + 1
Is Alone = 1
0
if Family Size = 1
if Family Size > 1
10. 10
Before Turning:
Training Score = 89.5%
Test Score = 82.05%
After Turning:
(Best max_depth = 4)
Training Score = 89.4%
Test Score = 87.4%
Alleviate the overfitting
11. 11
? Kaggle is a convenient platform to study and practice machine learning.
? Python code can be executed directly at the host server from the browser.
? Numerous datasets were provided on the site, including training and test data.
? Once the prediction file is submitted, a score will be returned to evaluate your model.
? Many developers share runnable code with detailed explanation.
? Appling artificial intelligence blindly without human intelligence is dangerous.
? Some ML models can be too complicated, leading to overfitting.
? The performance of some ML models can be worse than simple hand-made model.
? Combining AI and human logic can make the analytical process enjoyable and reliable.
Python code of the project at kaggle: https://www.kaggle.com/dingli/titanic-survivor-prediction-machine-learning