import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
'ignore') warnings.filterwarnings(
Introduction
Welcome to the Titanic Survival Prediction project, a classic challenge that serves as the perfect starting point for anyone looking to dive into the world of Machine Learning and data competitions on Kaggle. This project involves building a predictive model to determine which passengers survived the infamous Titanic disaster, a task that will guide you through essential steps in data science, from feature engineering to model development and evaluation.
This competition is designed to help you get comfortable with the Kaggle platform and machine learning workflows. You’ll be using Python to explore the data, perform feature engineering, visualize key trends, and develop a predictive model that can accurately classify survivors. For more details, visit the competition page and check out Kaggle’s YouTube video for a comprehensive introduction.
- Import Data: Load the Titanic dataset to begin the exploration and analysis.
- Feature Engineering: Transform raw data into meaningful features that improve model performance.
- Data Visualization: Analyze and visualize the data to uncover patterns and insights.
- Model Development: Build and train machine learning models to predict passenger survival.
- Model Testing: Evaluate model accuracy and fine-tune parameters to optimize results.
- Prediction and Submission: Generate survival predictions and submit them to the Kaggle leaderboard.
Embark on this journey to not only enhance your data science skills but also understand the power of predictive modeling in real-world scenarios.
A. Import Data
= pd.read_csv('../input/titanic/train.csv')
train = pd.read_csv('../input/titanic/test.csv') test
B. Dataset exploration:
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
B.1. Types of Variables
# Find categorical variables
= [var for var in train.columns if train[var].dtype=='O']
categorical print('There are {} categorical variables'.format(len(categorical)))
There are 5 categorical variables
# Find numerical variables
= [var for var in train.columns if train[var].dtype!='O']
numerical print('There are {} numerical variables'.format(len(numerical)))
There are 7 numerical variables
Viewing the Categorical terms :
= [train,test]
data for dataset in data:
#Filter categorical variables
= [x for x in dataset.dtypes.index if dataset.dtypes[x]=='object']
categorical_columns # Exclude ID cols and source:
= [x for x in categorical_columns if x not in ['PassengerId','Ticket','Name','Cabin']]
categorical_columns #Print frequency of categories
for col in categorical_columns:
print ('\nFrequency of Categories for variable %s'%col)
print (train[col].value_counts())
Frequency of Categories for variable Sex
Sex
male 577
female 314
Name: count, dtype: int64
Frequency of Categories for variable Embarked
Embarked
S 644
C 168
Q 77
Name: count, dtype: int64
B.2. Detecting Missing Values
sum() train.isnull().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
train.isnull().mean()
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
dtype: float64
Missing Data Overview
The train dataset has 12 features, with missing values observed in the following features:
- Age: Missing in 19.86% of the records
- Cabin: Missing in 77.10% of the records
- Embarked: Missing in 0.22% of the records
Analysis and Assumptions About Missing Data
Cabin
The Cabin feature has the highest proportion of missing values (77.10%). This substantial amount of missing data might suggest that:
- For many individuals who did not survive, the cabin information was not recorded or available.
- Survivors, on the other hand, may have been able to provide this information.
The missingness here could be due to the nature of the records or circumstances surrounding the individuals who did not survive, making this data likely to fall into the Missing Not At Random (MNAR) category. This means the missingness is related to the unobserved value itself or other factors not accounted for.
Age
The Age feature has missing values in about 22% of the records. This could be due to:
- Missing age information for individuals who did not survive.
- Survivors possibly being able to provide their age when asked.
This type of missing data might also be categorized as Missing Not At Random (MNAR) if the likelihood of missing data is related to whether the individual survived or other unobserved factors.
Embarked
The Embarked feature has a very small proportion of missing values (0.22%). This is a very minor amount and is likely due to random occurrences.
Such a small percentage of missing data is often considered Missing Completely At Random (MCAR), meaning the missingness is unrelated to the observed or unobserved data.
Summary
- Cabin and Age features likely fall into the MNAR category due to possible relationships between missingness and other factors like survival status.
- The Embarked feature’s missing values are likely MCAR, as the missingness appears random and does not correlate with other data aspects.
B.3. Outliers detection
=(8,6))
plt.figure(figsize1, 2, 1)
plt.subplot(= train.boxplot(column='Age')
fig '')
fig.set_title('Age')
fig.set_ylabel(
1, 2, 2)
plt.subplot(= train.boxplot(column='Fare')
fig '')
fig.set_title('Fare') fig.set_ylabel(
Text(0, 0.5, 'Fare')
=(8,6))
plt.figure(figsize
1, 2, 1)
plt.subplot(= train.Age.hist(bins=20)
fig 'Number of passengers')
fig.set_ylabel('Age')
fig.set_xlabel(
1, 2, 2)
plt.subplot(= train.Fare.hist(bins=20)
fig 'Number of passengers')
fig.set_ylabel('Fare') fig.set_xlabel(
Text(0.5, 0, 'Fare')
B.3. Analyzing the Embarked feature
train[train.Embarked.isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
The Embarked
feature, which records the port of embarkation for passengers, has a very small proportion of missing values (0.22%). This low percentage of missing data suggests a specific pattern in how the data might be missing.
Possible Reasons for Missing Values
Consistency Among Passengers: For passengers who share the same ticket, cabin, and fare, it is unlikely that the missing
Embarked
data is due to discrepancies in their records. This is because passengers with identical ticket and cabin information would typically have consistent embarkation data.Data Generation During Dataset Construction: The missing
Embarked
values could have resulted from data entry or construction processes. For example, if data was manually entered or generated, some records might have been incomplete due to errors or omissions during the data preparation phase.
Nature of Missing Data
Given that the missing values in the Embarked
feature are minimal and appear to be random rather than systematic, we can categorize this missing data as:
- Missing Completely At Random (MCAR): The missingness of the
Embarked
data is likely unrelated to both the values of theEmbarked
feature itself and any other features in the dataset. The small percentage of missing data indicates that these omissions do not follow a discernible pattern and are likely due to random errors in data entry or processing.
In summary, the missing values in the Embarked
feature are random and not indicative of any underlying patterns related to the data’s other aspects. This randomness supports the classification of this missing data as MCAR.
B.4. Analyzing Cabin feature
'cabin_null'] = np.where(train.Cabin.isnull(),1,0)
train['Survived'])['cabin_null'].mean() train.groupby([
Survived
0 0.876138
1 0.602339
Name: cabin_null, dtype: float64
The above figures indicates that the missing data is more in the case of passengers not survived(=0).
There is a systematic loss of data: people who did not survive tend to have more information missing. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.
B.5. Analyzing the Age feature
'age_null'] = np.where(train.Age.isnull(),1,0)
train['Survived'])['age_null'].mean() train.groupby([
Survived
0 0.227687
1 0.152047
Name: age_null, dtype: float64
There is a systematic loss of data: people who did not survive tend to have more information missing. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.
B.6. Analyzing the Fare feature
The distribution of Fare is skewed, so in principle, we shouldn’t estimate outliers using the mean plus minus 3 standard deviations methods, which assumes a normal distribution of the data.
= float(train.shape[0])
total_passengers
print('Total number of passengers: {}'.format(train.shape[0]))
print('Passengers that paid more than 65: {:.2f}%'.format(
> 65].shape[0] / total_passengers) * 100))
(train[train.Fare print('passengers that paid more than 100: {} %'.format((
> 100].shape[0]/ total_passengers)*100)) train[train.Fare
Total number of passengers: 891
Passengers that paid more than 65: 13.02%
passengers that paid more than 100: 5.948372615039282 %
There is unusual high values of Fares observed, the reason is found as follows:
#at the most extreme outliers
>300] train[train.Fare
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cabin_null | age_null | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C | 1 | 0 |
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | 0 | 0 |
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C | 0 | 0 |
These three people have the same ticket number, indicating that they were travelling together. The Fare price in this case, 512 is the price of 3 tickets, and not one. This is why, it is unusually high.
B.7. Categorical Values :
print('Number of categories in the variable Name: {}'.format(
len(train.Name.unique())))
print('Number of categories in the variable Gender: {}'.format(
len(train.Sex.unique())))
print('Number of categories in the variable Ticket: {}'.format(
len(train.Ticket.unique())))
print('Number of categories in the variable Cabin: {}'.format(
len(train.Cabin.unique())))
print('Number of categories in the variable Embarked: {}'.format(
len(train.Embarked.unique())))
print('Total number of passengers in the Titanic: {}'.format(len(train)))
Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 681
Number of categories in the variable Cabin: 148
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 891
= ['cabin_null','age_null']
drop_column =1 ,inplace = True ) train.drop(drop_column , axis
C. Feature Scaling and Engineering
Feature scaling is a technique used to standardize the range of independent variables or features of data. In machine learning and data analysis, scaling is important because it helps improve the performance and training stability of models.
C.1. Handling the Missing Values:
The dataset contains missing values in several features. To address these, we apply different strategies based on the nature of each feature:
sum() train.isnull().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
sum() test.isnull().
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
= [test , train]
data_cleaner for dataset in data_cleaner:
#completing missing age with median
'Age'].fillna(dataset['Age'].median(), inplace = True)
dataset[
#completing embarked with mode
'Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
dataset[
#completing missing fare with median
'Fare'].fillna(dataset['Fare'].median(), inplace = True)
dataset[
#delete the train feature
'Ticket'], axis=1, inplace = True)
train.drop(['Ticket'] , axis=1 , inplace = True) test.drop([
C.2. Encoding
Encoding is a crucial step in data preprocessing, especially for machine learning and statistical modeling. It involves converting categorical variables (features that represent categories or groups) into numerical values that can be processed by machine learning algorithms.
C.2.1. Cabin Feature
= ['Cabin']
drop_column =1 ,inplace = True )
train.drop(drop_column , axis =1 ,inplace = True ) test.drop(drop_column , axis
The Cabin feature has been dropped from the dataset due to its high proportion of missing values (77.10%), which makes it less informative.
C.2.2. Fare Feature
= [train,test]
full_data for dataset in full_data:
'Fare'] <= 7.91, 'Fare_Band'] = 0
dataset.loc[ dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare_Band'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare_Band'] = 2
dataset.loc[(dataset['Fare'] > 31, 'Fare_Band'] = 3
dataset.loc[ dataset['Fare_Band'] = dataset['Fare_Band'].astype(int)
dataset['Fare' ], axis = 1 , inplace =True) dataset.drop([
The Fare feature has been transformed into discrete fare bands. This transformation categorizes fare amounts into bins, which can simplify the modeling process and potentially reveal patterns.
C.2.3. Age Feature
= [test , train]
full_data for dataset in full_data:
'Age'] <= 10, 'Age'] = 0
dataset.loc[ dataset['Age'] > 10) & (dataset['Age'] <= 15), 'Age'] = 1
dataset.loc[(dataset['Age'] > 15) & (dataset['Age'] <= 20), 'Age'] = 2
dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 25), 'Age'] = 3
dataset.loc[(dataset['Age'] > 25) & (dataset['Age'] <= 30), 'Age'] = 4
dataset.loc[(dataset['Age'] > 30) & (dataset['Age'] <= 45), 'Age'] = 5
dataset.loc[(dataset['Age'] > 45) & (dataset['Age'] <= 60), 'Age'] = 6
dataset.loc[(dataset['Age'] > 60, 'Age'] = 7
dataset.loc[ dataset['Age'] = dataset['Age'].astype(int) dataset[
The Age feature has been converted into age bins, categorizing age into discrete intervals. This transformation simplifies the feature and can help capture age-related patterns more effectively.
C.2.4. Sex and Embarked Feature
= [test , train]
full_data for dataset in full_data:
'Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int) dataset[
The categorical features Embarked and Sex have been encoded into numeric values. This encoding converts categorical variables into a format suitable for machine learning models.
C.2.5. Droping the Name feature
'Name'],axis = 1, inplace = True)
train.drop(['Name'],axis = 1, inplace = True ) test.drop([
The Name feature, which does not provide useful information for modeling, has been removed from both the training and testing datasets.
C.2.6. Family Size
'family_size'] = train['SibSp'] + train['Parch'] + 1
train['family_size'] = test['SibSp'] + test['Parch'] + 1
test['IsAlone'] = 1
test['IsAlone'] = 1
train['IsAlone'].loc[train['family_size'] > 1] = 0
train['IsAlone'].loc[test['family_size'] > 1] = 0
test['SibSp' , 'Parch'], axis = 1 , inplace =True)
test.drop(['SibSp','Parch' ], axis = 1 , inplace =True) train.drop([
A new feature, family_size, is created by combining SibSp (siblings/spouses aboard) and Parch (parents/children aboard). This feature provides insight into the size of the family traveling with the passenger.
sum() test.isnull().
PassengerId 0
Pclass 0
Sex 0
Age 0
Embarked 0
Fare_Band 0
family_size 0
IsAlone 0
dtype: int64
D. Visualizations
Let’s get some insights !
= sns.FacetGrid(train, col="Survived", row="Sex", hue="Embarked", height=3)
g map(plt.hist, "Pclass", edgecolor="w").add_legend() g.
- Observations
- From above graph we observe that more number of females survived as compared to males. The female survivors were more from the first class and male from third class were the most to die.
- The 3rd class people were the most affected, that is they less survived where as 1st class people survived is maximum than others.
- The second class has almost equal survived and couldn’t survive number of people. And also we notice many of the passengers Embarked from “S”.
= [8,5])
plt.figure(figsize ="Fare_Band", y="Age", data=train, hue='Survived',palette='coolwarm') sns.violinplot(x
Mostly farebands are greater at the Age Group “4”. Survival also has greater area corresponding to age group “4”.
'family_size', 'Survived']].groupby(['family_size'], as_index=False).mean() train[[
family_size | Survived | |
---|---|---|
0 | 1 | 0.303538 |
1 | 2 | 0.552795 |
2 | 3 | 0.578431 |
3 | 4 | 0.724138 |
4 | 5 | 0.200000 |
5 | 6 | 0.136364 |
6 | 7 | 0.333333 |
7 | 8 | 0.000000 |
8 | 11 | 0.000000 |
= sns.catplot(x='family_size', y='Survived', hue='Sex', data=train, aspect=3, kind='point') axes
We find with increase in family size the survival rate decreases.
=(10,10))
plt.figure(figsize'PassengerId',axis=1).corr(), square=True, annot=True) sns.heatmap(train.drop(
- Undestanding the Correlation matrix:
- The FareBand and Pclass are highly correlated(-0.63) although negative, next to them is FareBand and IsAlone correlation(-0.57).
- The Sex and Survived also have good correlation of (-0.54).
- But as observed IsAlone and Family_size has the largest negative correlation (-0.69) is liable as the Family size and being alone are two opposite categories.
E. Model Training and Predicting
Spitting the data in ro train and test
= train.drop('Survived' , axis = 1 )
X = train['Survived']
y
from sklearn.model_selection import train_test_split
= train_test_split(X , y , test_size = 0.3 , random_state =102) X_train ,X_test , y_train , y_test
Also we need to remove Id of passengers for prediction,
=X_train.drop(['PassengerId'],axis=1)
X_train= X_test.drop(['PassengerId'],axis=1) X_test
Importing models from scikit learn module. The objective is to classify the passenger survivior into two classes: 0 or 1, hence this is a binary classification for which we will be using classifiers. Following part of this notebook compares and finds the best model suitable for the data based upon accuracy metrics.
#Importing all models
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
E.1. Logistic Regression
= LogisticRegression()
logmodel
logmodel.fit(X_train , y_train)= logmodel.predict(X_test)
pred_l = accuracy_score(y_test , pred_l)*100
acc_l acc_l
79.1044776119403
E.2. Random Forest
= RandomForestClassifier(n_estimators= 100)
random_forest
random_forest.fit(X_train, y_train)= random_forest.predict(X_test)
pred_rf = accuracy_score(y_test , pred_rf)*100
acc_rf acc_rf
83.2089552238806
E.3. K-Nearest Neighbours
= KNeighborsClassifier(n_neighbors = 3)
knn
knn.fit(X_train, y_train)
= knn.predict(X_test)
pred_knn
= accuracy_score(y_test , pred_knn)*100
acc_knn acc_knn
79.8507462686567
E.4. Gaussian Naive Bayes Classifier
= GaussianNB()
gaussian
gaussian.fit(X_train, y_train)
= gaussian.predict(X_test)
pred_gb
= accuracy_score(y_test , pred_gb)*100
acc_gb acc_gb
77.98507462686567
E.5. C-Support Vector Classifier
= SVC()
svc
svc.fit(X_train, y_train)
= svc.predict(X_test)
pred_svc
= accuracy_score(y_test , pred_svc)*100
acc_svc acc_svc
84.70149253731343
E.6. Decision Tree
= DecisionTreeClassifier()
decision_tree
decision_tree.fit(X_train, y_train)
= decision_tree.predict(X_test)
pred_dt
= accuracy_score(y_test , pred_dt)*100
acc_dt acc_dt
81.34328358208955
E.7. Linear classifiers with SGD training.
= SGDClassifier()
sgd
sgd.fit(X_train, y_train)
= sgd.predict(X_test)
pred_sgd
= accuracy_score(y_test , pred_sgd)*100
acc_sgd acc_sgd
73.13432835820896
## Arranging the Accuracy results
= pd.DataFrame({
models 'Model': ['Logistic Regression', 'Random Forrest','K- Nearest Neighbour' ,
'Naive Bayes' , 'C-Support Vector Classifier' , 'Decision Tree' , 'Stochastic Gradient Descent'],
'Score': [acc_l , acc_rf , acc_knn , acc_gb , acc_svc ,
acc_dt , acc_sgd]})='Score', ascending=False) models.sort_values(by
Model | Score | |
---|---|---|
4 | C-Support Vector Classifier | 84.701493 |
1 | Random Forrest | 83.208955 |
5 | Decision Tree | 81.343284 |
2 | K- Nearest Neighbour | 79.850746 |
0 | Logistic Regression | 79.104478 |
3 | Naive Bayes | 77.985075 |
6 | Stochastic Gradient Descent | 73.134328 |
Ensemble Learning
= test.drop(['PassengerId'],axis=1)
df_test
= logmodel.predict(df_test)
p_l = svc.predict(df_test)
p_svc = random_forest.predict(df_test)
p_rf = decision_tree.predict(df_test) p_dt
= np.zeros((df_test.shape[0]))
predict_combine for i in range(0, test.shape[0]):
= p_rf[i]+p_svc[i]+p_l[i]+p_dt[i]
temp if temp>=2:
= 1
predict_combine[i] = predict_combine.astype('int') predict_combine
Submission
= pd.DataFrame({
submission "PassengerId": test["PassengerId"],
"Survived": predict_combine
})
"submission.csv", encoding='utf-8', index=False) submission.to_csv(
submission.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
Thankyou!
- Kaggle: Notebook
- Resources: Udemy Course, Scikit-Learn docs