Tuesday 20 August 2024

Case Study on Survival data of Titanic Ship using Machine Learning

 Example Use Case: Predicting Titanic Survivors Using Machine Learning

A well-known example in the machine learning community is predicting the survival of passengers on the Titanic. The Titanic dataset contains details about the passengers aboard the ill-fated ship, such as their age, sex, class, and whether they survived or not. This data is often used as a beginner's project to demonstrate classification algorithms in ML.

 Dataset Overview:

The Titanic dataset consists of the following columns (features):
- PassengerId: Unique ID of the passenger.
- Pclass: The class of the passenger (1st, 2nd, or 3rd class).
- Name: The name of the passenger.
- Sex: The gender of the passenger (male or female).
- Age: The age of the passenger.
- SibSp: The number of siblings or spouses aboard the Titanic.
- Parch: The number of parents or children aboard.
- Ticket: The ticket number.
- Fare: The fare the passenger paid for the ticket.
- Cabin: The cabin where the passenger stayed (often missing).
- Embarked: The port at which the passenger boarded (C = Cherbourg, Q = Queenstown, S = Southampton).
- Survived: The target variable (1 = survived, 0 = did not survive).

 Objective:
The goal is to predict whether a passenger survived or not based on these features. This is a binary classification problem where the outcome (survived or not) is binary (0 or 1).

---

 Step-by-Step Example: Titanic Survival Prediction Using ML

 Step 1: Data Preprocessing

- Handle Missing Data: Some features, such as Age, Cabin, and Embarked, might have missing values. You would typically fill missing values with the median (for numerical data) or the most frequent value (for categorical data) or remove rows with too many missing values.
- Feature Engineering: Create new features that could be useful, such as:
  - Family Size: Combine "SibSp" and "Parch" to get the total family size aboard.
  - Title: Extract titles from the Name field (Mr., Mrs., etc.) to understand social status or age group.
  - Age Group: Convert age into categories (e.g., child, adult, elderly) if this is more predictive.

 Step 2: Feature Selection

- Select the most important features for training. For example, gender (Sex) is often a crucial feature in predicting survival, as women were more likely to survive. Pclass, Age, and Fare can also be important features.

 Step 3: Model Selection

- Choose an Algorithm: You could use a variety of ML models for this task, such as:
  - Logistic Regression: A simple model for binary classification.
  - Decision Trees: A tree-like model that splits data based on the most important features.
  - Random Forests: An ensemble of decision trees to reduce overfitting.
  - Support Vector Machines (SVM): A powerful classifier that works well for high-dimensional data.
  - Neural Networks: A more complex model, though often overkill for smaller datasets like this.

 Step 4: Model Training

- Train the model on a portion of the data (training set), and validate it using a separate part of the data (test set). You could use techniques like cross-validation to avoid overfitting and get a better estimate of how the model will perform on unseen data.

 Step 5: Model Evaluation

- Evaluate the performance of the model using metrics such as:
  - Accuracy: The percentage of correct predictions.
  - Precision: The proportion of true positives (survived passengers) among all positive predictions.
  - Recall: The proportion of true positives among all actual positives.
  - F1 Score: The harmonic mean of precision and recall, useful when the dataset is imbalanced.

 Step 6: Model Tuning

- Fine-tune the model's hyperparameters (e.g., regularization strength, depth of trees, etc.) to improve performance. This can be done using grid search or random search for hyperparameter optimization.

 Step 7: Predictions

- Once the model is trained and evaluated, you can use it to make predictions about new passengers' survival (e.g., people who were not on the Titanic but have similar characteristics).

---

Example: Key Insights from Titanic Prediction

After running the machine learning model on the Titanic dataset, you might find several insights that are both informative and actionable, such as:

1. Gender is the most important factor: The model might show that women had a significantly higher chance of survival than men. This aligns with historical records where women and children were prioritized during the evacuation.
2. Pclass matters: Passengers in higher classes (1st class) had a much better chance of survival than those in 3rd class, likely due to the location of their cabins and their proximity to the lifeboats.
3. Age and Family Size: Children and passengers traveling with families might have had higher survival rates, as they were often prioritized for lifeboats.
4. Fare: Wealthier passengers (who paid higher fares) were more likely to survive, again reflecting the social inequalities of the time.

---

 Potential Impact of ML in This Case

Machine learning models can help researchers, historians, or analysts extract patterns from historical datasets that were previously hard to quantify. In the Titanic example, using machine learning can reveal biases and social factors (such as class and gender) that influenced survival chances in ways that could be overlooked in manual analysis.

Moreover, ML can also be extended to more complex datasets, such as modern disaster survival analysis, helping authorities and organizations optimize evacuation procedures or make better-informed decisions during critical situations.

Palium Skills offers courses on Artificial Intelligence and Generative AI for the benefit of college and working professionals. The courses are completely handson  with guidance, demo and practices .