Name: Emad Takla

DAND P2

Dataset chosen: Titanic

Setup

Headers to be used throughout the project:

In [2]:
import unicodecsv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from IPython.display import HTML
from IPython.display import display_html

Special Commands

In [3]:
#Plot all figures within the html page
%pylab inline

#Make display better, like printing dataframes as tables
pd.set_option('display.notebook_repr_html', True)
Populating the interactive namespace from numpy and matplotlib

Importing Data

In [4]:
titanic_all_data = pd.read_csv('./titanic_data.csv')

Supporting Functions:

A function to map age into generation buckets: Child (0 to 16 years old), Adult (17 to 49 years old) and elderly (50 years old and over)

In [5]:
def age_to_generation(age):
    #Check if the input is negative, return an error
    if age < 0:
        return 'Invalid age'
    #If younger than 17, then it is classified as a child
    if age < 17:
        return 'Child'
    #else if it is between 18 and 49 (both inclusive), the passed age is for an adult
    if age < 50:
        return 'Adult'
    #Else if, the person is an elderly
    if age >= 50:
        return 'Elderly'
    #Else, the input age was unspecified (Blank cell, NaN)
    return 'Unspecified age'

A function to strip the last name from the column 'Name' The assumption is that the format is as follows: 'last_name, title. first_name (Alias\Maiden Name)'

In [6]:
def get_last_name(full_name):
    return full_name.split(',')[0]

A function to plot a population pyramid, a horizontal histogram bar chart plotted back to back

In [7]:
# Plotting code coming from http://stackoverflow.com/questions/27694221/using-python-libraries-to-plot-two-horizontal-bar-charts-sharing-same-y-axis
#Binning Code coming from http://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values

def plot_population_pyramid(bins, left_title, left_data, right_title, right_data, largest_x_value):
    fig, axes = plt.subplots(ncols=2, sharey=True)
    
    #largest_x_value:
    #this variable will be used to preserve scale. If not, the two graphs will extend to the maximum value of the dataset,
    #and will be visually misleading
    #largest_x_value = max(left_data.max(), right_data.max())
    
    axes[0].barh(bins, left_data, align='center')
    axes[0].set(title=left_title)
    axes[0].set_xlim(0, largest_x_value)
    
    axes[1].barh(bins, right_data, align='center')
    axes[1].set(title=right_title)
    axes[1].set_xlim(0, largest_x_value)
    
    axes[0].invert_xaxis()
    axes[0].set(yticks=bins)
    axes[0].yaxis.tick_right()

Two functions to count the number of survivors\victims within a passed GroupBy object. The Function Accepts A GroupBy Object, Not a DataFrame!

The need for the function came when groupby can get an input that had by chance all victims\survivors, and there was no other group for the missing opposite value. This created index errors when I programmatically looped over the groups, and I had to check if the group was present in the data structure before using it

In [8]:
def get_surviving_count(group_obj):
    #The values in Survived are either 1 or 0, with 1 indicating survivors
    if 1 in group_obj.groups.keys():
        return group_obj.get_group(1)['PassengerId'].count()
    else:
        return 0
In [9]:
def get_victim_count(group_obj):
    #The values in Survived are either 1 or 0, with 0 indicating victims
    if 0 in group_obj.groups.keys():
        return group_obj.get_group(0)['PassengerId'].count()
    else:
        return 0

Graphing Helper Functions and Variables.

In [10]:
gender_colors = ['hotpink','dodgerblue']
survival_colors = ['red', 'limegreen']

The following function accepts a groupby object, that must group its original input only by Survived (ie only at maximum two groups, group 0 (victim) and group 1 (survived)

In [11]:
def plot_survival_pie_chart(group_obj, label):
    graph_label = []
    survival_colors = []
    
    if 0 in group_obj.groups.keys():
        graph_label.append('Victim')
        survival_colors.append('red')
    if 1 in group_obj.groups.keys():
        graph_label.append('Survived')
        survival_colors.append('limegreen')
        
    group_obj['PassengerId'].count().plot.pie(label = label, autopct='%1.1f%%', colors=survival_colors,labels=graph_label)

A function to get the count of passengers in a passed dataframe

In [12]:
def get_count(df):
    return df['PassengerId'].count()

A function to make subplots. It will be overloaded to have a version where we can pass the labels parameter

In [13]:
def draw_pie_subplot(groupby_data, subplot_position, graph_label,  graph_type):
    graph_colors = []
    if(graph_type == 'SURVIVAL_GRAPH'):
        graph_colors = survival_colors
    elif(graph_type == 'GENDER_GRAPH'):
        graph_colors = gender_colors

    fig.add_subplot(subplot_position)
    
    #If it's a survival graph, use the already created function to plot it
    if (graph_type == 'SURVIVAL_GRAPH'):
        plot_survival_pie_chart(groupby_data ,graph_label )
        
    #Default colors. Do not pass the colors parameter to the plotting function
    elif graph_type == 'DEFAULT':
        get_count(groupby_data).plot.pie(label = graph_label, autopct='%1.1f%%')
        
    else:
        get_count(groupby_data).plot.pie(label = graph_label, \
                                     autopct='%1.1f%%',\
                                     colors = graph_colors)
    plt.axis('equal')
    

Questions

Q1: What is the effect of Traveling With First Degree Relatives Over the Survival of a Passenger ?

http://www.durhamcollege.ca/wp-content/uploads/STAT_nullalternate_hypothesis.pdf

  • The question that will be investigated in this analysis is: if a passenger was traveling without a family (ie both SibSp and Parch were equal to zero), did he have a higher\lower chance of survival ? Was having a family on board an advantage, disadvantage or irrelevant for the survival of a passenger ?

Null Hypothesis: There is NO difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.

Alternate Hypothesis: There IS a difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.

**α: 0.05**

Q2: Is It Possible From The Provided Data To Identify Same Family Members?

This is more of a data investigation question, not a statistical inference one. The data provides the last names, ages, traveling class and companionship (Spouses, Sibling - Parents, Children). Are these enough to make (partial) educated guesses about the family members? And if there is a success to do that, can we infer if having a bigger family improved the chances of survival ? (Possibly, since the rest of the family would pressure the crew to allow their left-behind family member on board of the life boats)

Data Wrangling

New Fields to the Data

An interesting feature to show, is the generation to which the passenger belongs. There shall be three categories for that parameter (Child, Adult and Elderly) based on their age range. The breakdown is as follows

  • Child: 16 years old >= Age >= 0 years old
  • Adult: 49 years old >= Age >= 17 years old
  • Elderly: Age >= 50 years old

Any other values like NaN, negative numbers, blank fields, non-numerical values..etc will be noted as "Unspecified age"

The 'isSolo' field will be used to see if a passenger is traveling with his\her family or not. Here are the criterias used:

  • A solo traveler is a traveler whose SibSp\Parch fields are equal to zero.
  • A child cannot be a solo traveler, even if the SibSp\Parch fields are equal to zero.
  • Just a word of caution when interpreting the isSolo field: isSolo does not mean that the traveler was traveling totally alone, they can be traveling with friends for example.

The total number of companions is the sum of the fields Parch and SibSp. It will be used for creating some descriptive statitistics about companionship

The 'LastName' field simply extracts the last name of the passenger from the full name provided. This will be useful in answering the second question.

In [14]:
#Adding the generation data field to the table. The generation can have three values: Child, Adult or Elderly
#The buckets ranges were arbitrarily chosen 
titanic_all_data['Generation']  = titanic_all_data.loc[:,'Age'].apply(age_to_generation)

"""
If the traveler is a child (Under 17 years old), then automatically they are not a solo traveler (Can be traveling with a 
nanny or a close family-friend..etc). Or, if the traveler has a non-zero value in any of SibSp and Parch, then they are not 
solo Otherwise, they are a solo passenger, traveling alone
"""
#This one was tough, using the 'and' operator raised a ValueError, and it took me a while to find out that
#I should substitute it with bitwise &
titanic_all_data['isSolo'] = (titanic_all_data['SibSp'] + titanic_all_data['Parch'] == 0) & \
                             (titanic_all_data['Generation'] != 'Child')

titanic_all_data['TotalCompanions'] = titanic_all_data['Parch'] + titanic_all_data['SibSp']


#Stripping the last name of the passengers. This will be used to determine which passengers probably belong to the same family
titanic_all_data['LastName'] = titanic_all_data.loc[:,'Name'].apply(get_last_name)

Data Slicing

Extracting General Data About the Passengers

In [15]:
#Passengers by Gender
male_passengers = titanic_all_data[titanic_all_data['Sex'] == 'male'] 
female_passengers = titanic_all_data[titanic_all_data['Sex'] == 'female'] 

#Passengers by Generation
children_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Child' ]
adult_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Adult' ]
elderly_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Elderly' ]

#Passengers by Survival
surviving_passengers = titanic_all_data[ titanic_all_data['Survived'] == 1 ]
victim_passengers = titanic_all_data[ titanic_all_data['Survived'] == 0 ]

Data Groups:

In my opinion, groups are very straight forward when it comes to plot; but slicing dataframes is better for calculations. Just easier for me.

In [16]:
passengers_by_gender = titanic_all_data.groupby('Sex')
passengers_by_generation = titanic_all_data.groupby('Generation')
passengers_by_class = titanic_all_data.groupby('Pclass')
passengers_by_survival = titanic_all_data.groupby('Survived')
passengers_by_companionship = titanic_all_data.groupby('isSolo')

passengers_by_class_and_gender = titanic_all_data.groupby(['Pclass', 'Sex'])
passengers_by_class_and_generation = titanic_all_data.groupby(['Pclass', 'Generation'])
passengers_by_class_and_survival = titanic_all_data.groupby(['Pclass', 'Survived'])
passengers_by_generation_and_gender = titanic_all_data.groupby(['Generation', 'Sex'])

Slicing The Passengers' Parameters According to their Traveling Class

I have followed the instructions in the previous review by using groupby instead of slicing. However, I still cannot see the advantage in that. In fact, I think that the first way was more readable. For example: first_class_passengers[ first_class_passengers['Survived'] == 0 ] vs passengers_by_class_and_survival.get_group( (1,0) )

In [17]:
#Passengers by class
########################################################################################################################
#First Class Data Splitting:
first_class_passengers = passengers_by_class.get_group(1)

#Gender                  
first_class_male_passengers = passengers_by_class_and_gender.get_group( (1,'male') )
first_class_female_passengers = passengers_by_class_and_gender.get_group( (1,'female') ) 
#Children
first_class_children_passengers = passengers_by_class_and_generation.get_group( (1, 'Child') )
first_class_children_male_passengers = first_class_children_passengers.groupby('Sex').get_group('male')
first_class_children_female_passengers = first_class_children_passengers.groupby('Sex').get_group('female')
#Adults
first_class_adult_passengers = passengers_by_class_and_generation.get_group( (1, 'Adult') )
first_class_adult_male_passengers = first_class_adult_passengers.groupby('Sex').get_group('male')
first_class_adult_female_passengers = first_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
first_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
first_class_elderly_male_passengers = first_class_elderly_passengers.groupby('Sex').get_group('male')
first_class_elderly_female_passengers = first_class_elderly_passengers.groupby('Sex').get_group('female')

#First Class Survival
first_class_survivors = passengers_by_class_and_survival.get_group( (1,1) )
first_class_victims = passengers_by_class_and_survival.get_group( (1,0) )

########################################################################################################################
#Second Class Data Splitting:
second_class_passengers = passengers_by_class.get_group(2)

#Gender                  
second_class_male_passengers = passengers_by_class_and_gender.get_group( (2,'male') )
second_class_female_passengers = passengers_by_class_and_gender.get_group( (2,'female') ) 
#Children
second_class_children_passengers = passengers_by_class_and_generation.get_group( (2, 'Child') )
second_class_children_male_passengers = second_class_children_passengers.groupby('Sex').get_group('male')
second_class_children_female_passengers = second_class_children_passengers.groupby('Sex').get_group('female')
#Adults
second_class_adult_passengers = passengers_by_class_and_generation.get_group( (2, 'Adult') )
second_class_adult_male_passengers = second_class_adult_passengers.groupby('Sex').get_group('male')
second_class_adult_female_passengers = second_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
second_class_elderly_passengers = passengers_by_class_and_generation.get_group( (2, 'Elderly') )
second_class_elderly_male_passengers = second_class_elderly_passengers.groupby('Sex').get_group('male')
second_class_elderly_female_passengers = second_class_elderly_passengers.groupby('Sex').get_group('female')

#second Class Survival
second_class_survivors = passengers_by_class_and_survival.get_group( (2,1) )
second_class_victims = passengers_by_class_and_survival.get_group( (2,0) )

########################################################################################################################
#Third Class Data Splitting:
third_class_passengers = passengers_by_class.get_group(3)

#Gender                  
third_class_male_passengers = passengers_by_class_and_gender.get_group( (3,'male') )
third_class_female_passengers = passengers_by_class_and_gender.get_group( (3,'female') ) 
#Children
third_class_children_passengers = passengers_by_class_and_generation.get_group( (3, 'Child') )
third_class_children_male_passengers = third_class_children_passengers.groupby('Sex').get_group('male')
third_class_children_female_passengers = third_class_children_passengers.groupby('Sex').get_group('female')
#Adults
third_class_adult_passengers = passengers_by_class_and_generation.get_group( (3, 'Adult') )
third_class_adult_male_passengers = third_class_adult_passengers.groupby('Sex').get_group('male')
third_class_adult_female_passengers = third_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
third_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
third_class_elderly_male_passengers = third_class_elderly_passengers.groupby('Sex').get_group('male')
third_class_elderly_female_passengers = third_class_elderly_passengers.groupby('Sex').get_group('female')

#third Class Survival
third_class_survivors = passengers_by_class_and_survival.get_group( (3,1) )
third_class_victims = passengers_by_class_and_survival.get_group( (3,0) )

Data Exploration and Analysis

Miscellaneous Data Count Computation

In [18]:
sample_size = get_count(titanic_all_data)
male_count = get_count(passengers_by_gender.get_group('male'))
female_count = get_count(passengers_by_gender.get_group('female'))

children_count = get_count(children_passengers)
children_male_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'male')) )
children_female_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'female')) )

adult_count = get_count(adult_passengers)
adult_male_count = get_count( passengers_by_generation_and_gender.get_group(('Adult', 'male')) )
adult_female_count = get_count( passengers_by_generation_and_gender.get_group(('Child', 'female')) )

elderly_count = get_count(elderly_passengers)
elderly_male_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'male')) )
elderly_female_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'female')) )

first_class_count = get_count(first_class_passengers)
first_class_male_count = get_count(first_class_male_passengers)
first_class_female_count = get_count(first_class_female_passengers)
first_class_children_count = get_count(first_class_children_passengers)
first_class_children_male_count = get_count(first_class_children_male_passengers)
first_class_children_female_count = get_count(first_class_children_female_passengers)
first_class_adult_count = get_count(first_class_adult_passengers)
first_class_adult_male_count = get_count(first_class_adult_male_passengers)
first_class_adult_female_count = get_count(first_class_adult_female_passengers)
first_class_elderly_count = get_count(first_class_elderly_passengers)
first_class_elderly_male_count = get_count(first_class_elderly_male_passengers)
first_class_elderly_female_count = get_count(first_class_elderly_female_passengers)

second_class_count = get_count(second_class_passengers)
second_class_male_count = get_count(second_class_male_passengers)
second_class_female_count = get_count(second_class_female_passengers)
second_class_children_count = get_count(second_class_children_passengers)
second_class_children_male_count = get_count(second_class_children_male_passengers)
second_class_children_female_count = get_count(second_class_children_female_passengers)
second_class_adult_count = get_count(second_class_adult_passengers)
second_class_adult_male_count = get_count(second_class_adult_male_passengers)
second_class_adult_female_count = get_count(second_class_adult_female_passengers)
second_class_elderly_count = get_count(second_class_elderly_passengers)
second_class_elderly_male_count = get_count(second_class_elderly_male_passengers)
second_class_elderly_female_count = get_count(second_class_elderly_female_passengers)

third_class_count = get_count(third_class_passengers)
third_class_male_count = get_count(third_class_male_passengers)
third_class_female_count = get_count(third_class_female_passengers)
third_class_children_count = get_count(third_class_children_passengers)
third_class_children_male_count = get_count(third_class_children_male_passengers)
third_class_children_female_count = get_count(third_class_children_female_passengers)
third_class_adult_count = get_count(third_class_adult_passengers)
third_class_adult_male_count = get_count(third_class_adult_male_passengers)
third_class_adult_female_count = get_count(third_class_adult_female_passengers)
third_class_elderly_count = get_count(third_class_elderly_passengers)
third_class_elderly_male_count = get_count(third_class_elderly_male_passengers)
third_class_elderly_female_count = get_count(third_class_elderly_female_passengers)

Finding Out Missing Fields

In [19]:
#The PassengerId is present for all the passengers, but there are some columns that has some missing values. Which columns 
#are they?
for column_id in titanic_all_data.columns.values:
    column_non_NaN_count = titanic_all_data[column_id].count()
    if column_non_NaN_count != sample_size:
        print str(column_id) + ": missing " + str(sample_size - column_non_NaN_count) + " values"
Age: missing 177 values
Cabin: missing 687 values
Embarked: missing 2 values

I am going to leave the blank cells empty, until I need to fill them with a value - if needed. For the time being, I am not sure for example if I should fill the missing ages with the mean age or just treat them as zeros. Let us wait and see what need would arise.

A word of caution about the ages in this data set

Not all passengers have known ages. An assumption: that the passengers whose age is not know are equally distributed along all age range, so the effect of missing this piece of information is minimal. For curiosity, below is provided the percentages of survivors and victims whose ages are unknown:

In [20]:
#Get the count of both survivors and victims whose age is null
survivors_without_age_count = surviving_passengers['Age'].isnull().sum()
victims_without_age_count = victim_passengers['Age'].isnull().sum()

#Count the total of survivors and victims, so we can calculate the percentage of passengers with missing ages
total_survivors_count = get_count(surviving_passengers)
total_victims_count = get_count(victim_passengers)

print "Percentage of survivors with missing age: ", float(survivors_without_age_count)*100/total_survivors_count,"%"
print "Percentage of victims with missing age:   ", float(victims_without_age_count)*100/total_victims_count,"%"
Percentage of survivors with missing age:  15.2046783626 %
Percentage of victims with missing age:    22.7686703097 %

Demographics

Some default columns, will be used as a dataframe column. These dataframes will be used to display the data in an HTML table format. Although there are alternatives to this, using the dataframe as a way to display tables was the easiest one.

Quick Statistics, All The Sample

Sample's Gender Make Up

In [21]:
#Rows to be displayed
rows = [{male_count, female_count, sample_size}]

#create the data frame, in preparation for HTML table display
df = pd.DataFrame(rows, columns=['Male', 'Female',"Total"], index=["count"])

#Display the dataframe as an HTML table. I had to use this function since just writing "df" on a single line
#will not display the table when there is a plot coming afterwards in the same cell.
display_html(df)

#Plot the pie chart, all passengers by gender
passengers_by_gender.size().plot.pie(label="Gender Ratio of All Passengers", autopct='%1.1f%%', colors=gender_colors)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()
Male Female Total
count 577 314 891

Sample's Generation Make Up:

In [22]:
#I had a problem creating the dataframe in the same manner as the previous one, since the order was not preserved 
#(ie Children count was below column elderly for example)
unspecified_age_count = get_count(passengers_by_generation.get_group("Unspecified age"))

#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Children':children_count, 'Adult':adult_count,'Elderly':elderly_count,'Unspecified Age':unspecified_age_count},\
                   index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by their generation
passengers_by_generation.size().plot.pie(label="Generation Ratio of All Passengers", autopct='%1.1f%%', startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()
Adult Children Elderly Unspecified Age
count 540 100 74 177

Population Pyramid

The population pyramid of the passengers. Naturally, this excludes the passengers whose age was missing, so the total count here will not be equal to that of the whole sample.

In [23]:
#Age bins, from zero to 75 years old with 5 years increment
age_bins = range(0,75,5)

#Group both genders according to the age bin
grouped_female_ages = get_count(female_passengers.groupby( pd.cut( female_passengers["Age"], np.arange(0, 80, 5) ) ))
grouped_male_ages =   get_count(male_passengers.groupby( pd.cut( male_passengers["Age"], np.arange(0, 80, 5) ) ))

#Get the highest count. Required so that both plots would keep the same scale when displayed
largest_x_value = max(grouped_female_ages.max(), grouped_male_ages.max())

#A helper function defined in the beginning of the file. Plots the age pyramid
plot_population_pyramid(age_bins, "Male Population", grouped_male_ages, "Female Population", grouped_female_ages, largest_x_value)

A noticeable mode is seen on the 20 years old bucket, in both genders.

Sample's Class Make Up

In [24]:
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'First Class':first_class_count,\
                   'Second Class':second_class_count,\
                   'Third Class':third_class_count}, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
passengers_by_class.size().plot.pie(label="Passengers, by Class", autopct='%1.1f%%', startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()
First Class Second Class Third Class
count 216 184 491

Sample's Survival

In [25]:
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Survivors': get_count(surviving_passengers) ,\
                   'Victims':   get_count(victim_passengers) }, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
#I am not goin to use the custom survival pie chart because I want to make the start angle = 90 degrees
passengers_by_survival.size().plot.pie(label="Passengers, by Survival", \
                                       autopct='%1.1f%%', \
                                       labels=['Victim', 'Survived'], \
                                       startangle=90,\
                                       colors=survival_colors)

plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()
Survivors Victims
count 342 549

Sample's Companionship (Passengers Traveling Alone vs Passengers Traveling With First Degree Family Member(s) )

In [26]:
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'In Group':get_count(surviving_passengers),\
                   'Solo   ': get_count(victim_passengers) }, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
get_count(titanic_all_data.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
                                                                   label = "Companionship", \
                                                                   autopct='%1.1f%%', \
                                                                   startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()
In Group Solo
count 342 549

Exploring The Data From the Generation Point of View

Sample's Generation, Broken Down by Gender

In [27]:
#Compute the unspecified age parameters.
unspecified_age_by_gender = passengers_by_generation.get_group("Unspecified age").groupby('Sex')
unspecified_age_male_count = get_count(unspecified_age_by_gender.get_group('male'))
unspecified_age_female_count = get_count(unspecified_age_by_gender.get_group('female'))

#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Children':[children_male_count, children_female_count], \
                   'Adult':[adult_male_count, adult_female_count],\
                   'Elderly':[elderly_male_count, elderly_female_count],\
                   'Unspecified Age':[unspecified_age_male_count, unspecified_age_female_count]}, \
                  index=["Male Count", "Female Count"])

#Display the dataframe as an HTML table
display_html(df)

#STart a multiplot plot
fig = plt.figure(figsize=(8,12))

#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Sex'), 411, 'Children', 'GENDER_GRAPH')

#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Sex'), 412, 'Adults', 'GENDER_GRAPH')

#Elderly  gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Sex'), 413, 'Elderly', 'GENDER_GRAPH')


#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Sex'), 414, 'Unspecified Age', 'GENDER_GRAPH')

plt.tight_layout()
Adult Children Elderly Unspecified Age
Male Count 350 51 52 124
Female Count 49 49 22 53

Only the children passengers seemed to have a balanced proportion of both gender, all the rest have a higher proportion of men.

Survival, Broken Down by Generation

In [28]:
#Group each generation by survival.
children_by_survival = children_passengers.groupby('Survived')
adult_by_survival = adult_passengers.groupby('Survived')
elderly_by_survival = elderly_passengers.groupby('Survived')
unspecified_age_by_survival = passengers_by_generation.get_group("Unspecified age").groupby('Survived')

#Get the count of each group's survival
children_survived = get_count(children_by_survival.get_group(1))
children_victim = get_count(children_by_survival.get_group(0))

adult_survived = get_count(adult_by_survival.get_group(1))
adult_victim = get_count(adult_by_survival.get_group(0))

elderly_survived = get_count(elderly_by_survival.get_group(1))
elderly_victim = get_count(elderly_by_survival.get_group(0))

unspecified_age_survived = get_count(unspecified_age_by_survival.get_group(1))
unspecified_age_victim = get_count(unspecified_age_by_survival.get_group(0))



#Create the dataframe of Children, Adult, Elderly and Unspecified age's survival make up.
df = pd.DataFrame({'Children':[children_survived, children_victim], \
                   'Adult':[adult_survived, adult_victim],\
                   'Elderly':[elderly_survived, elderly_victim],\
                   'Unspecified Age':[unspecified_age_survived, unspecified_age_victim]}, \
                  index=["Survivors Count", "Victims Count"])

#Display the dataframe as an HTML table
display_html(df)

#Start a multiplot plot
fig = plt.figure(figsize=(8,12))

#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Survived'), 411, 'Children', 'SURVIVAL_GRAPH')

#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Survived'), 412, 'Adults', 'SURVIVAL_GRAPH')

#Elderly  gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Survived'), 413, 'Elderly', 'SURVIVAL_GRAPH')

#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Survived'), 414, 'Unspecified', 'SURVIVAL_GRAPH')

plt.tight_layout()
Adult Children Elderly Unspecified Age
Survivors Count 208 55 27 52
Victims Count 332 45 47 125

Generation Survival, Broken Down by Gender

In [29]:
#####################
#Children
children_male_passengers = children_passengers[ children_passengers['Sex'] == 'male' ]
children_female_passengers = children_passengers[ children_passengers['Sex'] == 'female' ]

children_male_by_survival = children_male_passengers.groupby('Survived')
children_male_survived = get_count(children_male_by_survival.get_group(1))
children_male_victim = get_count(children_male_by_survival.get_group(0))

children_female_by_survival = children_female_passengers.groupby('Survived')
children_female_survived = get_count(children_female_by_survival.get_group(1))
children_female_victim = get_count(children_female_by_survival.get_group(0))


#####################
#Adults
adults_male_passengers = adult_passengers[ adult_passengers['Sex'] == 'male' ]
adults_female_passengers = adult_passengers[ adult_passengers['Sex'] == 'female' ]

adults_male_by_survival = adults_male_passengers.groupby('Survived')
adults_male_survived = get_count(adults_male_by_survival.get_group(1))
adults_male_victim = get_count(adults_male_by_survival.get_group(0))

adults_female_by_survival = adults_female_passengers.groupby('Survived')
adults_female_survived = get_count(adults_female_by_survival.get_group(1))
adults_female_victim = get_count(adults_female_by_survival.get_group(0))

#####################
#Elderly
elderly_male_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'male' ]
elderly_female_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'female' ]

elderly_male_by_survival = elderly_male_passengers.groupby('Survived')
elderly_male_survived = get_count(elderly_male_by_survival.get_group(1))
elderly_male_victim = get_count(elderly_male_by_survival.get_group(0))

elderly_female_by_survival = elderly_female_passengers.groupby('Survived')
elderly_female_survived = get_count(elderly_female_by_survival.get_group(1))
elderly_female_victim = get_count(elderly_female_by_survival.get_group(0))



#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Male Children':   [children_male_survived, children_male_victim], \
                   'Female Children': [children_female_survived, children_female_victim],\
                   'Male Adults':     [adults_male_survived, adults_male_victim],\
                   'Female Adults':   [adults_female_survived, adults_female_victim],\
                   'Male Elderly':    [elderly_male_survived, elderly_male_victim],\
                   'Female Elderly':  [elderly_female_survived, elderly_female_victim] }, index = ["Survived", "Victim"])


#Display the dataframe as an HTML table
display_html(df)


#Start a multiplot plot
fig = plt.figure(figsize=(15,12))

#Children pie plot
## Male children
draw_pie_subplot( children_male_by_survival, 321, 'Children Male', 'SURVIVAL_GRAPH')
##Female Children
draw_pie_subplot( children_female_by_survival, 322, 'Children Female', 'SURVIVAL_GRAPH')


#Adults pie plot
##Male Adults
draw_pie_subplot( adults_male_by_survival, 323, 'Adults Male', 'SURVIVAL_GRAPH')
##Female Adults
draw_pie_subplot( adults_female_by_survival, 324, 'Adults Female', 'SURVIVAL_GRAPH')


#Elderly  pie plot
##Male Elderly
draw_pie_subplot( elderly_male_by_survival, 325, 'Elderly Males', 'SURVIVAL_GRAPH')
##Female Elderly
draw_pie_subplot( elderly_female_by_survival, 326, 'Elderly Females', 'SURVIVAL_GRAPH')

plt.tight_layout()
Female Adults Female Children Female Elderly Male Adults Male Children Male Elderly
Survived 144 33 20 64 22 7
Victim 46 16 2 286 29 45

I find it interesting that even male children had a noticeably lower survival rate than female children.

Exploring The Sample From The Class Point of View

Population Pyramid of The Classes

In [30]:
#The code here follows the same logic for drawing the population pyramid for all the population above.
#But here, we are going to draw three graphs, one for each class
age_bins = range(0,75,5)

first_class_grouped_female_ages = get_count(first_class_female_passengers.groupby( pd.cut( first_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
first_class_grouped_male_ages =   get_count(first_class_male_passengers.groupby( pd.cut( first_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))

second_class_grouped_female_ages = get_count(second_class_female_passengers.groupby( pd.cut( second_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
second_class_grouped_male_ages =   get_count(second_class_male_passengers.groupby( pd.cut( second_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))

third_class_grouped_female_ages = get_count(third_class_female_passengers.groupby( pd.cut( third_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
third_class_grouped_male_ages =   get_count(third_class_male_passengers.groupby( pd.cut( third_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))



fig = plt.figure(figsize=(7,7))

first_max_x = max(first_class_grouped_female_ages.max(), first_class_grouped_male_ages.max())
second_max_x = max(second_class_grouped_female_ages.max(), second_class_grouped_male_ages.max())
third_max_x = max(third_class_grouped_female_ages.max(), third_class_grouped_male_ages.max())

largest_x_value = max( [first_max_x, second_max_x, third_max_x] )

#Draw the population pyramid for the first class
fig.add_subplot(311)
plot_population_pyramid(age_bins, \
                        "1st classs Male Population", \
                        first_class_grouped_male_ages, \
                        "1st class Female Population", \
                        first_class_grouped_female_ages, \
                        largest_x_value)

#Draw the population pyramid for the second class
fig.add_subplot(312)
plot_population_pyramid(age_bins, \
                        "2nd class Male Population", \
                        second_class_grouped_male_ages, \
                        "2nd class Female Population", \
                        second_class_grouped_female_ages, \
                        largest_x_value)

#Draw the population pyramid for the third class
fig.add_subplot(313)
plot_population_pyramid(age_bins, \
                        "3rd class Male Population", \
                        third_class_grouped_male_ages, \
                        "3rd class Female Population", \
                        third_class_grouped_female_ages, \
                        largest_x_value)
  • I was not able to remove the first 3 emtpy plots, I think it is maybe a bug on PyPlot. More inverstigations will be needed.

The third class' population pyramid is very skewed towards the male, with the largest spikes occur in between 15 to 30 years old.

Classes, By Gender Ratio

In [31]:
#Create the dataframe of each class' gender make up.
df = pd.DataFrame({'First Class':  [first_class_male_count, first_class_female_count], \
                   'Second Class': [second_class_male_count, second_class_female_count],\
                   'Third Class':  [third_class_male_count, third_class_female_count]},\
                   index = ["Male", "Female"])


#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(10,7))
#Plot the first class
draw_pie_subplot( first_class_passengers.groupby('Sex'), 131, 'First Class', 'GENDER_GRAPH')

#Plot the second class
draw_pie_subplot( second_class_passengers.groupby('Sex'), 132, 'Second Class', 'GENDER_GRAPH')

#Plot the third class
draw_pie_subplot( third_class_passengers.groupby('Sex'), 133, 'Third Class', 'GENDER_GRAPH')

plt.tight_layout()
First Class Second Class Third Class
Male 122 108 347
Female 94 76 144

Classes, By Generation Make Up

In [32]:
first_class_unspecified_age = first_class_count - (first_class_children_count + first_class_adult_count + first_class_elderly_count)
second_class_unspecified_age = second_class_count - (second_class_children_count + second_class_adult_count + second_class_elderly_count)
third_class_unspecified_age = third_class_count - (third_class_children_count + third_class_adult_count + third_class_elderly_count)

#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Children':       [first_class_children_count, second_class_children_count, third_class_children_count], \
                   'Adults':         [first_class_adult_count, second_class_adult_count, third_class_adult_count],\
                   'Elderly':        [first_class_elderly_count, second_class_elderly_count, third_class_elderly_count],\
                   'Unspecified Age':[first_class_unspecified_age, second_class_unspecified_age, third_class_unspecified_age]},\
                   index = ["First Class", "Second Class", "Third Class"])


#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(9,12))

#First Class
draw_pie_subplot(first_class_passengers.groupby('Generation'), 311, 'First Class by Generation', 'DEFAULT')

#Second Class
draw_pie_subplot(second_class_passengers.groupby('Generation'), 312, 'Second Class by Generation', 'DEFAULT')

#Third Class
draw_pie_subplot(third_class_passengers.groupby('Generation'), 313, 'Third Class by Generation', 'DEFAULT')


plt.tight_layout()
Adults Children Elderly Unspecified Age
First Class 133 9 44 30
Second Class 133 21 19 11
Third Class 274 70 44 103

It might be worth noting that the majority of the passengers of unspecified age come from the third class.

Classes, By Companionship

Class Survival

In [33]:
#Separate each class passengers by survival
first_class_by_survival = first_class_passengers.groupby("Survived")
second_class_by_survival = second_class_passengers.groupby("Survived")
third_class_by_survival = third_class_passengers.groupby("Survived")

#Get the count of both survivors and victims, by each class
first_class_surviving_count = get_count(first_class_by_survival.get_group(1))
first_class_victims_count = get_count(first_class_by_survival.get_group(0))

second_class_surviving_count = get_count(second_class_by_survival.get_group(1))
second_class_victims_count = get_count(second_class_by_survival.get_group(0))

third_class_surviving_count = get_count(third_class_by_survival.get_group(1))
third_class_victims_count = get_count(third_class_by_survival.get_group(0))


#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Survived': [first_class_surviving_count, second_class_surviving_count, third_class_surviving_count], \
                   'Died':     [first_class_victims_count, second_class_victims_count, third_class_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(9,12))

#First Class
draw_pie_subplot( first_class_passengers.groupby('Survived'), 311, 'First Class by Survival', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_passengers.groupby('Survived') , 312, 'Second Class by Survival', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_passengers.groupby('Survived') , 313, 'Third Class by Survival', 'SURVIVAL_GRAPH')

plt.tight_layout()
Died Survived
First Class 80 136
Second Class 97 87
Third Class 372 119

Just visually, one can see that the better the class was, the higher the chances of survival were. But this is just superficially, from the pie charts. Other factors may be involved as well, like the total number of passengers and the gender makeup of each class (The percentage of males in the third class was much higher than the other two classes).

Survival: From The Point of View of The Generation And Gender

Class' Children Survival, Broken down By Both The Total And The Gender

In [34]:
#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_children_by_survival = first_class_children_passengers.groupby("Survived")
first_class_children_male_by_survival = first_class_children_male_passengers.groupby("Survived")
first_class_children_female_by_survival = first_class_children_female_passengers.groupby("Survived")

#####Second Class
second_class_children_by_survival = second_class_children_passengers.groupby("Survived")
second_class_children_male_by_survival = second_class_children_male_passengers.groupby("Survived")
second_class_children_female_by_survival = second_class_children_female_passengers.groupby("Survived")

#####Third Class
third_class_children_by_survival = third_class_children_passengers.groupby("Survived")
third_class_children_male_by_survival = third_class_children_male_passengers.groupby("Survived")
third_class_children_female_by_survival = third_class_children_female_passengers.groupby("Survived")

#Get the count of each category created in the cell above
first_class_children_surviving_count = get_surviving_count( first_class_children_by_survival)
first_class_children_victims_count = get_victim_count(first_class_children_by_survival)
first_class_children_male_surviving_count = get_surviving_count( first_class_children_male_by_survival)
first_class_children_male_victims_count = get_victim_count(first_class_children_male_by_survival)
first_class_children_female_surviving_count = get_surviving_count(first_class_children_female_by_survival)
first_class_children_female_victims_count = get_victim_count(first_class_children_female_by_survival)

second_class_children_surviving_count = get_surviving_count( second_class_children_by_survival)
second_class_children_victims_count = get_victim_count(second_class_children_by_survival)
second_class_children_male_surviving_count = get_surviving_count( second_class_children_male_by_survival)
second_class_children_male_victims_count = get_victim_count(second_class_children_male_by_survival)
second_class_children_female_surviving_count = get_surviving_count(second_class_children_female_by_survival)
second_class_children_female_victims_count = get_victim_count(second_class_children_female_by_survival)

third_class_children_surviving_count = get_surviving_count( third_class_children_by_survival)
third_class_children_victims_count = get_victim_count(third_class_children_by_survival)
third_class_children_male_surviving_count = get_surviving_count( third_class_children_male_by_survival)
third_class_children_male_victims_count = get_victim_count(third_class_children_male_by_survival)
third_class_children_female_surviving_count = get_surviving_count(third_class_children_female_by_survival)
third_class_children_female_victims_count = get_victim_count(third_class_children_female_by_survival)

#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'Survived': [first_class_children_surviving_count, second_class_children_surviving_count, third_class_children_surviving_count], \
                   'Died':     [first_class_children_victims_count, second_class_children_victims_count, third_class_children_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)

#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'First Class':  [first_class_children_male_surviving_count, \
                                    first_class_children_male_victims_count, \
                                    first_class_children_female_surviving_count, \
                                    first_class_children_female_victims_count], \
                   
                   'Second Class': [second_class_children_male_surviving_count, \
                                    second_class_children_male_victims_count, \
                                    second_class_children_female_surviving_count, \
                                    second_class_children_female_victims_count],\
                   
                   'Third Class':  [third_class_children_male_surviving_count, \
                                    third_class_children_male_victims_count, \
                                    third_class_children_female_surviving_count, \
                                    third_class_children_female_victims_count]},\
                  
                   index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(15,12))

#First Class
draw_pie_subplot( first_class_children_by_survival , 331, 'First Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_male_by_survival , 332, 'First Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_female_by_survival , 333, 'First Class Female Children', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_children_by_survival , 334, 'Second Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_male_by_survival , 335, 'Second Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_female_by_survival , 336, 'Second Class Female Children', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_children_by_survival , 337, 'Third Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_male_by_survival , 338, 'Third Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_female_by_survival , 339, 'Third Class Female Children', 'SURVIVAL_GRAPH')
plt.tight_layout()
Died Survived
First Class 1 8
Second Class 2 19
Third Class 42 28
First Class Second Class Third Class
Male Survivor 3 9 10
Male Victim 0 2 27
Female Survivor 5 10 18
Female Victim 1 0 15

Total Children Survival, By Class

In [35]:
children_by_class = children_passengers.groupby('Pclass')

children_survived_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
children_drowned_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')

fig = plt.figure(figsize=(9,5))

draw_pie_subplot( children_survived_by_class, 121, 'Surviving Children By Class', 'DEFAULT')

draw_pie_subplot( children_drowned_by_class, 122, 'Victim Children By Class', 'DEFAULT')

plt.tight_layout()