Name: Emad Takla

DAND P2

Dataset chosen: Titanic

Setup¶

Headers to be used throughout the project:¶

import unicodecsv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from IPython.display import HTML
from IPython.display import display_html

Special Commands¶

#Plot all figures within the html page
%pylab inline

#Make display better, like printing dataframes as tables
pd.set_option('display.notebook_repr_html', True)

Populating the interactive namespace from numpy and matplotlib

Importing Data¶

titanic_all_data = pd.read_csv('./titanic_data.csv')

Supporting Functions:¶

A function to map age into generation buckets: Child (0 to 16 years old), Adult (17 to 49 years old) and elderly (50 years old and over)

def age_to_generation(age):
    #Check if the input is negative, return an error
    if age < 0:
        return 'Invalid age'
    #If younger than 17, then it is classified as a child
    if age < 17:
        return 'Child'
    #else if it is between 18 and 49 (both inclusive), the passed age is for an adult
    if age < 50:
        return 'Adult'
    #Else if, the person is an elderly
    if age >= 50:
        return 'Elderly'
    #Else, the input age was unspecified (Blank cell, NaN)
    return 'Unspecified age'

A function to strip the last name from the column 'Name' The assumption is that the format is as follows: 'last_name, title. first_name (Alias\Maiden Name)'

def get_last_name(full_name):
    return full_name.split(',')[0]

A function to plot a population pyramid, a horizontal histogram bar chart plotted back to back

# Plotting code coming from http://stackoverflow.com/questions/27694221/using-python-libraries-to-plot-two-horizontal-bar-charts-sharing-same-y-axis
#Binning Code coming from http://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values

def plot_population_pyramid(bins, left_title, left_data, right_title, right_data, largest_x_value):
    fig, axes = plt.subplots(ncols=2, sharey=True)
    
    #largest_x_value:
    #this variable will be used to preserve scale. If not, the two graphs will extend to the maximum value of the dataset,
    #and will be visually misleading
    #largest_x_value = max(left_data.max(), right_data.max())
    
    axes[0].barh(bins, left_data, align='center')
    axes[0].set(title=left_title)
    axes[0].set_xlim(0, largest_x_value)
    
    axes[1].barh(bins, right_data, align='center')
    axes[1].set(title=right_title)
    axes[1].set_xlim(0, largest_x_value)
    
    axes[0].invert_xaxis()
    axes[0].set(yticks=bins)
    axes[0].yaxis.tick_right()

Two functions to count the number of survivors\victims within a passed GroupBy object. The Function Accepts A GroupBy Object, Not a DataFrame!

The need for the function came when groupby can get an input that had by chance all victims\survivors, and there was no other group for the missing opposite value. This created index errors when I programmatically looped over the groups, and I had to check if the group was present in the data structure before using it

def get_surviving_count(group_obj):
    #The values in Survived are either 1 or 0, with 1 indicating survivors
    if 1 in group_obj.groups.keys():
        return group_obj.get_group(1)['PassengerId'].count()
    else:
        return 0

def get_victim_count(group_obj):
    #The values in Survived are either 1 or 0, with 0 indicating victims
    if 0 in group_obj.groups.keys():
        return group_obj.get_group(0)['PassengerId'].count()
    else:
        return 0

Graphing Helper Functions and Variables.

gender_colors = ['hotpink','dodgerblue']
survival_colors = ['red', 'limegreen']

The following function accepts a groupby object, that must group its original input only by Survived (ie only at maximum two groups, group 0 (victim) and group 1 (survived)

def plot_survival_pie_chart(group_obj, label):
    graph_label = []
    survival_colors = []
    
    if 0 in group_obj.groups.keys():
        graph_label.append('Victim')
        survival_colors.append('red')
    if 1 in group_obj.groups.keys():
        graph_label.append('Survived')
        survival_colors.append('limegreen')
        
    group_obj['PassengerId'].count().plot.pie(label = label, autopct='%1.1f%%', colors=survival_colors,labels=graph_label)

A function to get the count of passengers in a passed dataframe

def get_count(df):
    return df['PassengerId'].count()

A function to make subplots. It will be overloaded to have a version where we can pass the labels parameter

def draw_pie_subplot(groupby_data, subplot_position, graph_label,  graph_type):
    graph_colors = []
    if(graph_type == 'SURVIVAL_GRAPH'):
        graph_colors = survival_colors
    elif(graph_type == 'GENDER_GRAPH'):
        graph_colors = gender_colors

    fig.add_subplot(subplot_position)
    
    #If it's a survival graph, use the already created function to plot it
    if (graph_type == 'SURVIVAL_GRAPH'):
        plot_survival_pie_chart(groupby_data ,graph_label )
        
    #Default colors. Do not pass the colors parameter to the plotting function
    elif graph_type == 'DEFAULT':
        get_count(groupby_data).plot.pie(label = graph_label, autopct='%1.1f%%')
        
    else:
        get_count(groupby_data).plot.pie(label = graph_label, \
                                     autopct='%1.1f%%',\
                                     colors = graph_colors)
    plt.axis('equal')

Questions¶

Q1: What is the effect of Traveling With First Degree Relatives Over the Survival of a Passenger ?¶

http://www.durhamcollege.ca/wp-content/uploads/STAT_nullalternate_hypothesis.pdf

The question that will be investigated in this analysis is: if a passenger was traveling without a family (ie both SibSp and Parch were equal to zero), did he have a higher\lower chance of survival ? Was having a family on board an advantage, disadvantage or irrelevant for the survival of a passenger ?

Null Hypothesis: There is NO difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.

Alternate Hypothesis: There IS a difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.

**α: 0.05**

Q2: Is It Possible From The Provided Data To Identify Same Family Members?¶

This is more of a data investigation question, not a statistical inference one. The data provides the last names, ages, traveling class and companionship (Spouses, Sibling - Parents, Children). Are these enough to make (partial) educated guesses about the family members? And if there is a success to do that, can we infer if having a bigger family improved the chances of survival ? (Possibly, since the rest of the family would pressure the crew to allow their left-behind family member on board of the life boats)

Data Wrangling¶

New Fields to the Data¶

An interesting feature to show, is the generation to which the passenger belongs. There shall be three categories for that parameter (Child, Adult and Elderly) based on their age range. The breakdown is as follows

Child: 16 years old >= Age >= 0 years old
Adult: 49 years old >= Age >= 17 years old
Elderly: Age >= 50 years old

Any other values like NaN, negative numbers, blank fields, non-numerical values..etc will be noted as "Unspecified age"

The 'isSolo' field will be used to see if a passenger is traveling with his\her family or not. Here are the criterias used:

A solo traveler is a traveler whose SibSp\Parch fields are equal to zero.
A child cannot be a solo traveler, even if the SibSp\Parch fields are equal to zero.
Just a word of caution when interpreting the isSolo field: isSolo does not mean that the traveler was traveling totally alone, they can be traveling with friends for example.

The total number of companions is the sum of the fields Parch and SibSp. It will be used for creating some descriptive statitistics about companionship

The 'LastName' field simply extracts the last name of the passenger from the full name provided. This will be useful in answering the second question.

#Adding the generation data field to the table. The generation can have three values: Child, Adult or Elderly
#The buckets ranges were arbitrarily chosen 
titanic_all_data['Generation']  = titanic_all_data.loc[:,'Age'].apply(age_to_generation)

"""
If the traveler is a child (Under 17 years old), then automatically they are not a solo traveler (Can be traveling with a 
nanny or a close family-friend..etc). Or, if the traveler has a non-zero value in any of SibSp and Parch, then they are not 
solo Otherwise, they are a solo passenger, traveling alone
"""
#This one was tough, using the 'and' operator raised a ValueError, and it took me a while to find out that
#I should substitute it with bitwise &
titanic_all_data['isSolo'] = (titanic_all_data['SibSp'] + titanic_all_data['Parch'] == 0) & \
                             (titanic_all_data['Generation'] != 'Child')

titanic_all_data['TotalCompanions'] = titanic_all_data['Parch'] + titanic_all_data['SibSp']


#Stripping the last name of the passengers. This will be used to determine which passengers probably belong to the same family
titanic_all_data['LastName'] = titanic_all_data.loc[:,'Name'].apply(get_last_name)

Data Slicing¶

Extracting General Data About the Passengers¶

#Passengers by Gender
male_passengers = titanic_all_data[titanic_all_data['Sex'] == 'male'] 
female_passengers = titanic_all_data[titanic_all_data['Sex'] == 'female'] 

#Passengers by Generation
children_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Child' ]
adult_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Adult' ]
elderly_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Elderly' ]

#Passengers by Survival
surviving_passengers = titanic_all_data[ titanic_all_data['Survived'] == 1 ]
victim_passengers = titanic_all_data[ titanic_all_data['Survived'] == 0 ]

Data Groups:¶

In my opinion, groups are very straight forward when it comes to plot; but slicing dataframes is better for calculations. Just easier for me.

passengers_by_gender = titanic_all_data.groupby('Sex')
passengers_by_generation = titanic_all_data.groupby('Generation')
passengers_by_class = titanic_all_data.groupby('Pclass')
passengers_by_survival = titanic_all_data.groupby('Survived')
passengers_by_companionship = titanic_all_data.groupby('isSolo')

passengers_by_class_and_gender = titanic_all_data.groupby(['Pclass', 'Sex'])
passengers_by_class_and_generation = titanic_all_data.groupby(['Pclass', 'Generation'])
passengers_by_class_and_survival = titanic_all_data.groupby(['Pclass', 'Survived'])
passengers_by_generation_and_gender = titanic_all_data.groupby(['Generation', 'Sex'])

Slicing The Passengers' Parameters According to their Traveling Class¶

I have followed the instructions in the previous review by using groupby instead of slicing. However, I still cannot see the advantage in that. In fact, I think that the first way was more readable. For example: first_class_passengers[ first_class_passengers['Survived'] == 0 ] vs passengers_by_class_and_survival.get_group( (1,0) )

#Passengers by class
########################################################################################################################
#First Class Data Splitting:
first_class_passengers = passengers_by_class.get_group(1)

#Gender                  
first_class_male_passengers = passengers_by_class_and_gender.get_group( (1,'male') )
first_class_female_passengers = passengers_by_class_and_gender.get_group( (1,'female') ) 
#Children
first_class_children_passengers = passengers_by_class_and_generation.get_group( (1, 'Child') )
first_class_children_male_passengers = first_class_children_passengers.groupby('Sex').get_group('male')
first_class_children_female_passengers = first_class_children_passengers.groupby('Sex').get_group('female')
#Adults
first_class_adult_passengers = passengers_by_class_and_generation.get_group( (1, 'Adult') )
first_class_adult_male_passengers = first_class_adult_passengers.groupby('Sex').get_group('male')
first_class_adult_female_passengers = first_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
first_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
first_class_elderly_male_passengers = first_class_elderly_passengers.groupby('Sex').get_group('male')
first_class_elderly_female_passengers = first_class_elderly_passengers.groupby('Sex').get_group('female')

#First Class Survival
first_class_survivors = passengers_by_class_and_survival.get_group( (1,1) )
first_class_victims = passengers_by_class_and_survival.get_group( (1,0) )

########################################################################################################################
#Second Class Data Splitting:
second_class_passengers = passengers_by_class.get_group(2)

#Gender                  
second_class_male_passengers = passengers_by_class_and_gender.get_group( (2,'male') )
second_class_female_passengers = passengers_by_class_and_gender.get_group( (2,'female') ) 
#Children
second_class_children_passengers = passengers_by_class_and_generation.get_group( (2, 'Child') )
second_class_children_male_passengers = second_class_children_passengers.groupby('Sex').get_group('male')
second_class_children_female_passengers = second_class_children_passengers.groupby('Sex').get_group('female')
#Adults
second_class_adult_passengers = passengers_by_class_and_generation.get_group( (2, 'Adult') )
second_class_adult_male_passengers = second_class_adult_passengers.groupby('Sex').get_group('male')
second_class_adult_female_passengers = second_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
second_class_elderly_passengers = passengers_by_class_and_generation.get_group( (2, 'Elderly') )
second_class_elderly_male_passengers = second_class_elderly_passengers.groupby('Sex').get_group('male')
second_class_elderly_female_passengers = second_class_elderly_passengers.groupby('Sex').get_group('female')

#second Class Survival
second_class_survivors = passengers_by_class_and_survival.get_group( (2,1) )
second_class_victims = passengers_by_class_and_survival.get_group( (2,0) )

########################################################################################################################
#Third Class Data Splitting:
third_class_passengers = passengers_by_class.get_group(3)

#Gender                  
third_class_male_passengers = passengers_by_class_and_gender.get_group( (3,'male') )
third_class_female_passengers = passengers_by_class_and_gender.get_group( (3,'female') ) 
#Children
third_class_children_passengers = passengers_by_class_and_generation.get_group( (3, 'Child') )
third_class_children_male_passengers = third_class_children_passengers.groupby('Sex').get_group('male')
third_class_children_female_passengers = third_class_children_passengers.groupby('Sex').get_group('female')
#Adults
third_class_adult_passengers = passengers_by_class_and_generation.get_group( (3, 'Adult') )
third_class_adult_male_passengers = third_class_adult_passengers.groupby('Sex').get_group('male')
third_class_adult_female_passengers = third_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
third_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
third_class_elderly_male_passengers = third_class_elderly_passengers.groupby('Sex').get_group('male')
third_class_elderly_female_passengers = third_class_elderly_passengers.groupby('Sex').get_group('female')

#third Class Survival
third_class_survivors = passengers_by_class_and_survival.get_group( (3,1) )
third_class_victims = passengers_by_class_and_survival.get_group( (3,0) )

Data Exploration and Analysis¶

Miscellaneous Data Count Computation¶

sample_size = get_count(titanic_all_data)
male_count = get_count(passengers_by_gender.get_group('male'))
female_count = get_count(passengers_by_gender.get_group('female'))

children_count = get_count(children_passengers)
children_male_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'male')) )
children_female_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'female')) )

adult_count = get_count(adult_passengers)
adult_male_count = get_count( passengers_by_generation_and_gender.get_group(('Adult', 'male')) )
adult_female_count = get_count( passengers_by_generation_and_gender.get_group(('Child', 'female')) )

elderly_count = get_count(elderly_passengers)
elderly_male_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'male')) )
elderly_female_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'female')) )

first_class_count = get_count(first_class_passengers)
first_class_male_count = get_count(first_class_male_passengers)
first_class_female_count = get_count(first_class_female_passengers)
first_class_children_count = get_count(first_class_children_passengers)
first_class_children_male_count = get_count(first_class_children_male_passengers)
first_class_children_female_count = get_count(first_class_children_female_passengers)
first_class_adult_count = get_count(first_class_adult_passengers)
first_class_adult_male_count = get_count(first_class_adult_male_passengers)
first_class_adult_female_count = get_count(first_class_adult_female_passengers)
first_class_elderly_count = get_count(first_class_elderly_passengers)
first_class_elderly_male_count = get_count(first_class_elderly_male_passengers)
first_class_elderly_female_count = get_count(first_class_elderly_female_passengers)

second_class_count = get_count(second_class_passengers)
second_class_male_count = get_count(second_class_male_passengers)
second_class_female_count = get_count(second_class_female_passengers)
second_class_children_count = get_count(second_class_children_passengers)
second_class_children_male_count = get_count(second_class_children_male_passengers)
second_class_children_female_count = get_count(second_class_children_female_passengers)
second_class_adult_count = get_count(second_class_adult_passengers)
second_class_adult_male_count = get_count(second_class_adult_male_passengers)
second_class_adult_female_count = get_count(second_class_adult_female_passengers)
second_class_elderly_count = get_count(second_class_elderly_passengers)
second_class_elderly_male_count = get_count(second_class_elderly_male_passengers)
second_class_elderly_female_count = get_count(second_class_elderly_female_passengers)

third_class_count = get_count(third_class_passengers)
third_class_male_count = get_count(third_class_male_passengers)
third_class_female_count = get_count(third_class_female_passengers)
third_class_children_count = get_count(third_class_children_passengers)
third_class_children_male_count = get_count(third_class_children_male_passengers)
third_class_children_female_count = get_count(third_class_children_female_passengers)
third_class_adult_count = get_count(third_class_adult_passengers)
third_class_adult_male_count = get_count(third_class_adult_male_passengers)
third_class_adult_female_count = get_count(third_class_adult_female_passengers)
third_class_elderly_count = get_count(third_class_elderly_passengers)
third_class_elderly_male_count = get_count(third_class_elderly_male_passengers)
third_class_elderly_female_count = get_count(third_class_elderly_female_passengers)

Finding Out Missing Fields¶

#The PassengerId is present for all the passengers, but there are some columns that has some missing values. Which columns 
#are they?
for column_id in titanic_all_data.columns.values:
    column_non_NaN_count = titanic_all_data[column_id].count()
    if column_non_NaN_count != sample_size:
        print str(column_id) + ": missing " + str(sample_size - column_non_NaN_count) + " values"

Age: missing 177 values
Cabin: missing 687 values
Embarked: missing 2 values

I am going to leave the blank cells empty, until I need to fill them with a value - if needed. For the time being, I am not sure for example if I should fill the missing ages with the mean age or just treat them as zeros. Let us wait and see what need would arise.

A word of caution about the ages in this data set¶

Not all passengers have known ages. An assumption: that the passengers whose age is not know are equally distributed along all age range, so the effect of missing this piece of information is minimal. For curiosity, below is provided the percentages of survivors and victims whose ages are unknown:

#Get the count of both survivors and victims whose age is null
survivors_without_age_count = surviving_passengers['Age'].isnull().sum()
victims_without_age_count = victim_passengers['Age'].isnull().sum()

#Count the total of survivors and victims, so we can calculate the percentage of passengers with missing ages
total_survivors_count = get_count(surviving_passengers)
total_victims_count = get_count(victim_passengers)

print "Percentage of survivors with missing age: ", float(survivors_without_age_count)*100/total_survivors_count,"%"
print "Percentage of victims with missing age:   ", float(victims_without_age_count)*100/total_victims_count,"%"

Percentage of survivors with missing age:  15.2046783626 %
Percentage of victims with missing age:    22.7686703097 %

Demographics¶

Some default columns, will be used as a dataframe column. These dataframes will be used to display the data in an HTML table format. Although there are alternatives to this, using the dataframe as a way to display tables was the easiest one.

Quick Statistics, All The Sample¶

Sample's Gender Make Up¶

#Rows to be displayed
rows = [{male_count, female_count, sample_size}]

#create the data frame, in preparation for HTML table display
df = pd.DataFrame(rows, columns=['Male', 'Female',"Total"], index=["count"])

#Display the dataframe as an HTML table. I had to use this function since just writing "df" on a single line
#will not display the table when there is a plot coming afterwards in the same cell.
display_html(df)

#Plot the pie chart, all passengers by gender
passengers_by_gender.size().plot.pie(label="Gender Ratio of All Passengers", autopct='%1.1f%%', colors=gender_colors)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()

Sample's Generation Make Up:¶

#I had a problem creating the dataframe in the same manner as the previous one, since the order was not preserved 
#(ie Children count was below column elderly for example)
unspecified_age_count = get_count(passengers_by_generation.get_group("Unspecified age"))

#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Children':children_count, 'Adult':adult_count,'Elderly':elderly_count,'Unspecified Age':unspecified_age_count},\
                   index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by their generation
passengers_by_generation.size().plot.pie(label="Generation Ratio of All Passengers", autopct='%1.1f%%', startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()

Population Pyramid¶

The population pyramid of the passengers. Naturally, this excludes the passengers whose age was missing, so the total count here will not be equal to that of the whole sample.

#Age bins, from zero to 75 years old with 5 years increment
age_bins = range(0,75,5)

#Group both genders according to the age bin
grouped_female_ages = get_count(female_passengers.groupby( pd.cut( female_passengers["Age"], np.arange(0, 80, 5) ) ))
grouped_male_ages =   get_count(male_passengers.groupby( pd.cut( male_passengers["Age"], np.arange(0, 80, 5) ) ))

#Get the highest count. Required so that both plots would keep the same scale when displayed
largest_x_value = max(grouped_female_ages.max(), grouped_male_ages.max())

#A helper function defined in the beginning of the file. Plots the age pyramid
plot_population_pyramid(age_bins, "Male Population", grouped_male_ages, "Female Population", grouped_female_ages, largest_x_value)

A noticeable mode is seen on the 20 years old bucket, in both genders.

Sample's Class Make Up¶

#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'First Class':first_class_count,\
                   'Second Class':second_class_count,\
                   'Third Class':third_class_count}, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
passengers_by_class.size().plot.pie(label="Passengers, by Class", autopct='%1.1f%%', startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()

Sample's Survival¶

#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Survivors': get_count(surviving_passengers) ,\
                   'Victims':   get_count(victim_passengers) }, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
#I am not goin to use the custom survival pie chart because I want to make the start angle = 90 degrees
passengers_by_survival.size().plot.pie(label="Passengers, by Survival", \
                                       autopct='%1.1f%%', \
                                       labels=['Victim', 'Survived'], \
                                       startangle=90,\
                                       colors=survival_colors)

plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()

Sample's Companionship (Passengers Traveling Alone vs Passengers Traveling With First Degree Family Member(s) )¶

#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'In Group':get_count(surviving_passengers),\
                   'Solo   ': get_count(victim_passengers) }, index=["count"])

#Display the table
display_html(df)

#Plot the pie chart of all the passengers, grouped by class
get_count(titanic_all_data.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
                                                                   label = "Companionship", \
                                                                   autopct='%1.1f%%', \
                                                                   startangle=90)
plt.axis('equal')

#This command removes some floating numbers that appears
plt.tight_layout()

Exploring The Data From the Generation Point of View¶

Sample's Generation, Broken Down by Gender¶

#Compute the unspecified age parameters.
unspecified_age_by_gender = passengers_by_generation.get_group("Unspecified age").groupby('Sex')
unspecified_age_male_count = get_count(unspecified_age_by_gender.get_group('male'))
unspecified_age_female_count = get_count(unspecified_age_by_gender.get_group('female'))

#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Children':[children_male_count, children_female_count], \
                   'Adult':[adult_male_count, adult_female_count],\
                   'Elderly':[elderly_male_count, elderly_female_count],\
                   'Unspecified Age':[unspecified_age_male_count, unspecified_age_female_count]}, \
                  index=["Male Count", "Female Count"])

#Display the dataframe as an HTML table
display_html(df)

#STart a multiplot plot
fig = plt.figure(figsize=(8,12))

#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Sex'), 411, 'Children', 'GENDER_GRAPH')

#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Sex'), 412, 'Adults', 'GENDER_GRAPH')

#Elderly  gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Sex'), 413, 'Elderly', 'GENDER_GRAPH')


#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Sex'), 414, 'Unspecified Age', 'GENDER_GRAPH')

plt.tight_layout()

Only the children passengers seemed to have a balanced proportion of both gender, all the rest have a higher proportion of men.

Survival, Broken Down by Generation¶

#Group each generation by survival.
children_by_survival = children_passengers.groupby('Survived')
adult_by_survival = adult_passengers.groupby('Survived')
elderly_by_survival = elderly_passengers.groupby('Survived')
unspecified_age_by_survival = passengers_by_generation.get_group("Unspecified age").groupby('Survived')

#Get the count of each group's survival
children_survived = get_count(children_by_survival.get_group(1))
children_victim = get_count(children_by_survival.get_group(0))

adult_survived = get_count(adult_by_survival.get_group(1))
adult_victim = get_count(adult_by_survival.get_group(0))

elderly_survived = get_count(elderly_by_survival.get_group(1))
elderly_victim = get_count(elderly_by_survival.get_group(0))

unspecified_age_survived = get_count(unspecified_age_by_survival.get_group(1))
unspecified_age_victim = get_count(unspecified_age_by_survival.get_group(0))



#Create the dataframe of Children, Adult, Elderly and Unspecified age's survival make up.
df = pd.DataFrame({'Children':[children_survived, children_victim], \
                   'Adult':[adult_survived, adult_victim],\
                   'Elderly':[elderly_survived, elderly_victim],\
                   'Unspecified Age':[unspecified_age_survived, unspecified_age_victim]}, \
                  index=["Survivors Count", "Victims Count"])

#Display the dataframe as an HTML table
display_html(df)

#Start a multiplot plot
fig = plt.figure(figsize=(8,12))

#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Survived'), 411, 'Children', 'SURVIVAL_GRAPH')

#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Survived'), 412, 'Adults', 'SURVIVAL_GRAPH')

#Elderly  gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Survived'), 413, 'Elderly', 'SURVIVAL_GRAPH')

#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Survived'), 414, 'Unspecified', 'SURVIVAL_GRAPH')

plt.tight_layout()

Generation Survival, Broken Down by Gender¶

#####################
#Children
children_male_passengers = children_passengers[ children_passengers['Sex'] == 'male' ]
children_female_passengers = children_passengers[ children_passengers['Sex'] == 'female' ]

children_male_by_survival = children_male_passengers.groupby('Survived')
children_male_survived = get_count(children_male_by_survival.get_group(1))
children_male_victim = get_count(children_male_by_survival.get_group(0))

children_female_by_survival = children_female_passengers.groupby('Survived')
children_female_survived = get_count(children_female_by_survival.get_group(1))
children_female_victim = get_count(children_female_by_survival.get_group(0))


#####################
#Adults
adults_male_passengers = adult_passengers[ adult_passengers['Sex'] == 'male' ]
adults_female_passengers = adult_passengers[ adult_passengers['Sex'] == 'female' ]

adults_male_by_survival = adults_male_passengers.groupby('Survived')
adults_male_survived = get_count(adults_male_by_survival.get_group(1))
adults_male_victim = get_count(adults_male_by_survival.get_group(0))

adults_female_by_survival = adults_female_passengers.groupby('Survived')
adults_female_survived = get_count(adults_female_by_survival.get_group(1))
adults_female_victim = get_count(adults_female_by_survival.get_group(0))

#####################
#Elderly
elderly_male_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'male' ]
elderly_female_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'female' ]

elderly_male_by_survival = elderly_male_passengers.groupby('Survived')
elderly_male_survived = get_count(elderly_male_by_survival.get_group(1))
elderly_male_victim = get_count(elderly_male_by_survival.get_group(0))

elderly_female_by_survival = elderly_female_passengers.groupby('Survived')
elderly_female_survived = get_count(elderly_female_by_survival.get_group(1))
elderly_female_victim = get_count(elderly_female_by_survival.get_group(0))



#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Male Children':   [children_male_survived, children_male_victim], \
                   'Female Children': [children_female_survived, children_female_victim],\
                   'Male Adults':     [adults_male_survived, adults_male_victim],\
                   'Female Adults':   [adults_female_survived, adults_female_victim],\
                   'Male Elderly':    [elderly_male_survived, elderly_male_victim],\
                   'Female Elderly':  [elderly_female_survived, elderly_female_victim] }, index = ["Survived", "Victim"])


#Display the dataframe as an HTML table
display_html(df)


#Start a multiplot plot
fig = plt.figure(figsize=(15,12))

#Children pie plot
## Male children
draw_pie_subplot( children_male_by_survival, 321, 'Children Male', 'SURVIVAL_GRAPH')
##Female Children
draw_pie_subplot( children_female_by_survival, 322, 'Children Female', 'SURVIVAL_GRAPH')


#Adults pie plot
##Male Adults
draw_pie_subplot( adults_male_by_survival, 323, 'Adults Male', 'SURVIVAL_GRAPH')
##Female Adults
draw_pie_subplot( adults_female_by_survival, 324, 'Adults Female', 'SURVIVAL_GRAPH')


#Elderly  pie plot
##Male Elderly
draw_pie_subplot( elderly_male_by_survival, 325, 'Elderly Males', 'SURVIVAL_GRAPH')
##Female Elderly
draw_pie_subplot( elderly_female_by_survival, 326, 'Elderly Females', 'SURVIVAL_GRAPH')

plt.tight_layout()

I find it interesting that even male children had a noticeably lower survival rate than female children.

Exploring The Sample From The Class Point of View¶

Population Pyramid of The Classes¶

#The code here follows the same logic for drawing the population pyramid for all the population above.
#But here, we are going to draw three graphs, one for each class
age_bins = range(0,75,5)

first_class_grouped_female_ages = get_count(first_class_female_passengers.groupby( pd.cut( first_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
first_class_grouped_male_ages =   get_count(first_class_male_passengers.groupby( pd.cut( first_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))

second_class_grouped_female_ages = get_count(second_class_female_passengers.groupby( pd.cut( second_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
second_class_grouped_male_ages =   get_count(second_class_male_passengers.groupby( pd.cut( second_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))

third_class_grouped_female_ages = get_count(third_class_female_passengers.groupby( pd.cut( third_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
third_class_grouped_male_ages =   get_count(third_class_male_passengers.groupby( pd.cut( third_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))



fig = plt.figure(figsize=(7,7))

first_max_x = max(first_class_grouped_female_ages.max(), first_class_grouped_male_ages.max())
second_max_x = max(second_class_grouped_female_ages.max(), second_class_grouped_male_ages.max())
third_max_x = max(third_class_grouped_female_ages.max(), third_class_grouped_male_ages.max())

largest_x_value = max( [first_max_x, second_max_x, third_max_x] )

#Draw the population pyramid for the first class
fig.add_subplot(311)
plot_population_pyramid(age_bins, \
                        "1st classs Male Population", \
                        first_class_grouped_male_ages, \
                        "1st class Female Population", \
                        first_class_grouped_female_ages, \
                        largest_x_value)

#Draw the population pyramid for the second class
fig.add_subplot(312)
plot_population_pyramid(age_bins, \
                        "2nd class Male Population", \
                        second_class_grouped_male_ages, \
                        "2nd class Female Population", \
                        second_class_grouped_female_ages, \
                        largest_x_value)

#Draw the population pyramid for the third class
fig.add_subplot(313)
plot_population_pyramid(age_bins, \
                        "3rd class Male Population", \
                        third_class_grouped_male_ages, \
                        "3rd class Female Population", \
                        third_class_grouped_female_ages, \
                        largest_x_value)

I was not able to remove the first 3 emtpy plots, I think it is maybe a bug on PyPlot. More inverstigations will be needed.

The third class' population pyramid is very skewed towards the male, with the largest spikes occur in between 15 to 30 years old.

Classes, By Gender Ratio¶

#Create the dataframe of each class' gender make up.
df = pd.DataFrame({'First Class':  [first_class_male_count, first_class_female_count], \
                   'Second Class': [second_class_male_count, second_class_female_count],\
                   'Third Class':  [third_class_male_count, third_class_female_count]},\
                   index = ["Male", "Female"])


#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(10,7))
#Plot the first class
draw_pie_subplot( first_class_passengers.groupby('Sex'), 131, 'First Class', 'GENDER_GRAPH')

#Plot the second class
draw_pie_subplot( second_class_passengers.groupby('Sex'), 132, 'Second Class', 'GENDER_GRAPH')

#Plot the third class
draw_pie_subplot( third_class_passengers.groupby('Sex'), 133, 'Third Class', 'GENDER_GRAPH')

plt.tight_layout()

Classes, By Generation Make Up¶

first_class_unspecified_age = first_class_count - (first_class_children_count + first_class_adult_count + first_class_elderly_count)
second_class_unspecified_age = second_class_count - (second_class_children_count + second_class_adult_count + second_class_elderly_count)
third_class_unspecified_age = third_class_count - (third_class_children_count + third_class_adult_count + third_class_elderly_count)

#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Children':       [first_class_children_count, second_class_children_count, third_class_children_count], \
                   'Adults':         [first_class_adult_count, second_class_adult_count, third_class_adult_count],\
                   'Elderly':        [first_class_elderly_count, second_class_elderly_count, third_class_elderly_count],\
                   'Unspecified Age':[first_class_unspecified_age, second_class_unspecified_age, third_class_unspecified_age]},\
                   index = ["First Class", "Second Class", "Third Class"])


#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(9,12))

#First Class
draw_pie_subplot(first_class_passengers.groupby('Generation'), 311, 'First Class by Generation', 'DEFAULT')

#Second Class
draw_pie_subplot(second_class_passengers.groupby('Generation'), 312, 'Second Class by Generation', 'DEFAULT')

#Third Class
draw_pie_subplot(third_class_passengers.groupby('Generation'), 313, 'Third Class by Generation', 'DEFAULT')


plt.tight_layout()

It might be worth noting that the majority of the passengers of unspecified age come from the third class.

Classes, By Companionship¶

Class Survival¶

#Separate each class passengers by survival
first_class_by_survival = first_class_passengers.groupby("Survived")
second_class_by_survival = second_class_passengers.groupby("Survived")
third_class_by_survival = third_class_passengers.groupby("Survived")

#Get the count of both survivors and victims, by each class
first_class_surviving_count = get_count(first_class_by_survival.get_group(1))
first_class_victims_count = get_count(first_class_by_survival.get_group(0))

second_class_surviving_count = get_count(second_class_by_survival.get_group(1))
second_class_victims_count = get_count(second_class_by_survival.get_group(0))

third_class_surviving_count = get_count(third_class_by_survival.get_group(1))
third_class_victims_count = get_count(third_class_by_survival.get_group(0))


#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Survived': [first_class_surviving_count, second_class_surviving_count, third_class_surviving_count], \
                   'Died':     [first_class_victims_count, second_class_victims_count, third_class_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(9,12))

#First Class
draw_pie_subplot( first_class_passengers.groupby('Survived'), 311, 'First Class by Survival', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_passengers.groupby('Survived') , 312, 'Second Class by Survival', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_passengers.groupby('Survived') , 313, 'Third Class by Survival', 'SURVIVAL_GRAPH')

plt.tight_layout()

Just visually, one can see that the better the class was, the higher the chances of survival were. But this is just superficially, from the pie charts. Other factors may be involved as well, like the total number of passengers and the gender makeup of each class (The percentage of males in the third class was much higher than the other two classes).

Survival: From The Point of View of The Generation And Gender¶

Class' Children Survival, Broken down By Both The Total And The Gender¶

#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_children_by_survival = first_class_children_passengers.groupby("Survived")
first_class_children_male_by_survival = first_class_children_male_passengers.groupby("Survived")
first_class_children_female_by_survival = first_class_children_female_passengers.groupby("Survived")

#####Second Class
second_class_children_by_survival = second_class_children_passengers.groupby("Survived")
second_class_children_male_by_survival = second_class_children_male_passengers.groupby("Survived")
second_class_children_female_by_survival = second_class_children_female_passengers.groupby("Survived")

#####Third Class
third_class_children_by_survival = third_class_children_passengers.groupby("Survived")
third_class_children_male_by_survival = third_class_children_male_passengers.groupby("Survived")
third_class_children_female_by_survival = third_class_children_female_passengers.groupby("Survived")

#Get the count of each category created in the cell above
first_class_children_surviving_count = get_surviving_count( first_class_children_by_survival)
first_class_children_victims_count = get_victim_count(first_class_children_by_survival)
first_class_children_male_surviving_count = get_surviving_count( first_class_children_male_by_survival)
first_class_children_male_victims_count = get_victim_count(first_class_children_male_by_survival)
first_class_children_female_surviving_count = get_surviving_count(first_class_children_female_by_survival)
first_class_children_female_victims_count = get_victim_count(first_class_children_female_by_survival)

second_class_children_surviving_count = get_surviving_count( second_class_children_by_survival)
second_class_children_victims_count = get_victim_count(second_class_children_by_survival)
second_class_children_male_surviving_count = get_surviving_count( second_class_children_male_by_survival)
second_class_children_male_victims_count = get_victim_count(second_class_children_male_by_survival)
second_class_children_female_surviving_count = get_surviving_count(second_class_children_female_by_survival)
second_class_children_female_victims_count = get_victim_count(second_class_children_female_by_survival)

third_class_children_surviving_count = get_surviving_count( third_class_children_by_survival)
third_class_children_victims_count = get_victim_count(third_class_children_by_survival)
third_class_children_male_surviving_count = get_surviving_count( third_class_children_male_by_survival)
third_class_children_male_victims_count = get_victim_count(third_class_children_male_by_survival)
third_class_children_female_surviving_count = get_surviving_count(third_class_children_female_by_survival)
third_class_children_female_victims_count = get_victim_count(third_class_children_female_by_survival)

#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'Survived': [first_class_children_surviving_count, second_class_children_surviving_count, third_class_children_surviving_count], \
                   'Died':     [first_class_children_victims_count, second_class_children_victims_count, third_class_children_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)

#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'First Class':  [first_class_children_male_surviving_count, \
                                    first_class_children_male_victims_count, \
                                    first_class_children_female_surviving_count, \
                                    first_class_children_female_victims_count], \
                   
                   'Second Class': [second_class_children_male_surviving_count, \
                                    second_class_children_male_victims_count, \
                                    second_class_children_female_surviving_count, \
                                    second_class_children_female_victims_count],\
                   
                   'Third Class':  [third_class_children_male_surviving_count, \
                                    third_class_children_male_victims_count, \
                                    third_class_children_female_surviving_count, \
                                    third_class_children_female_victims_count]},\
                  
                   index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(15,12))

#First Class
draw_pie_subplot( first_class_children_by_survival , 331, 'First Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_male_by_survival , 332, 'First Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_female_by_survival , 333, 'First Class Female Children', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_children_by_survival , 334, 'Second Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_male_by_survival , 335, 'Second Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_female_by_survival , 336, 'Second Class Female Children', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_children_by_survival , 337, 'Third Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_male_by_survival , 338, 'Third Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_female_by_survival , 339, 'Third Class Female Children', 'SURVIVAL_GRAPH')
plt.tight_layout()

Total Children Survival, By Class¶

children_by_class = children_passengers.groupby('Pclass')

children_survived_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
children_drowned_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')

fig = plt.figure(figsize=(9,5))

draw_pie_subplot( children_survived_by_class, 121, 'Surviving Children By Class', 'DEFAULT')

draw_pie_subplot( children_drowned_by_class, 122, 'Victim Children By Class', 'DEFAULT')

plt.tight_layout()

Class' Children Survival, Broken down By Both The Total And The Gender¶

#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_adult_by_survival = first_class_adult_passengers.groupby("Survived")
first_class_adult_male_by_survival = first_class_adult_male_passengers.groupby("Survived")
first_class_adult_female_by_survival = first_class_adult_female_passengers.groupby("Survived")

#####Second Class
second_class_adult_by_survival = second_class_adult_passengers.groupby("Survived")
second_class_adult_male_by_survival = second_class_adult_male_passengers.groupby("Survived")
second_class_adult_female_by_survival = second_class_adult_female_passengers.groupby("Survived")

#####Third Class
third_class_adult_by_survival = third_class_adult_passengers.groupby("Survived")
third_class_adult_male_by_survival = third_class_adult_male_passengers.groupby("Survived")
third_class_adult_female_by_survival = third_class_adult_female_passengers.groupby("Survived")

#Get the count of each category created in the cell above
first_class_adult_surviving_count = get_surviving_count( first_class_adult_by_survival)
first_class_adult_victims_count = get_victim_count(first_class_adult_by_survival)
first_class_adult_male_surviving_count = get_surviving_count( first_class_adult_male_by_survival)
first_class_adult_male_victims_count = get_victim_count(first_class_adult_male_by_survival)
first_class_adult_female_surviving_count = get_surviving_count(first_class_adult_female_by_survival)
first_class_adult_female_victims_count = get_victim_count(first_class_adult_female_by_survival)

second_class_adult_surviving_count = get_surviving_count( second_class_adult_by_survival)
second_class_adult_victims_count = get_victim_count(second_class_adult_by_survival)
second_class_adult_male_surviving_count = get_surviving_count( second_class_adult_male_by_survival)
second_class_adult_male_victims_count = get_victim_count(second_class_adult_male_by_survival)
second_class_adult_female_surviving_count = get_surviving_count(second_class_adult_female_by_survival)
second_class_adult_female_victims_count = get_victim_count(second_class_adult_female_by_survival)

third_class_adult_surviving_count = get_surviving_count( third_class_adult_by_survival)
third_class_adult_victims_count = get_victim_count(third_class_adult_by_survival)
third_class_adult_male_surviving_count = get_surviving_count( third_class_adult_male_by_survival)
third_class_adult_male_victims_count = get_victim_count(third_class_adult_male_by_survival)
third_class_adult_female_surviving_count = get_surviving_count(third_class_adult_female_by_survival)
third_class_adult_female_victims_count = get_victim_count(third_class_adult_female_by_survival)

#Create the dataframe of the classes' adult by survival
df = pd.DataFrame({'Survived': [first_class_adult_surviving_count, second_class_adult_surviving_count, third_class_adult_surviving_count], \
                   'Died':     [first_class_adult_victims_count, second_class_adult_victims_count, third_class_adult_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)

#Create the dataframe of the classes' adult by survival
df = pd.DataFrame({'First Class':  [first_class_adult_male_surviving_count, \
                                    first_class_adult_male_victims_count, \
                                    first_class_adult_female_surviving_count, \
                                    first_class_adult_female_victims_count], \
                   
                   'Second Class': [second_class_adult_male_surviving_count, \
                                    second_class_adult_male_victims_count, \
                                    second_class_adult_female_surviving_count, \
                                    second_class_adult_female_victims_count],\
                   
                   'Third Class':  [third_class_adult_male_surviving_count, \
                                    third_class_adult_male_victims_count, \
                                    third_class_adult_female_surviving_count, \
                                    third_class_adult_female_victims_count]},\
                   index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(15,12))

#First Class
draw_pie_subplot( first_class_adult_by_survival , 331, 'First Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_adult_male_by_survival , 332, 'First Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_adult_female_by_survival , 333, 'First Class Female Adult', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_adult_by_survival , 334, 'Second Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_adult_male_by_survival , 335, 'Second Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_adult_female_by_survival , 336, 'Second Class Female Adult', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_adult_by_survival , 337, 'Third Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_adult_male_by_survival , 338, 'Third Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_adult_female_by_survival , 339, 'Third Class Female Adult', 'SURVIVAL_GRAPH')

plt.tight_layout()

Second Class males, to my surprise, had the toughest luck. I was expecting the third class males to have the toughest one.

Total Adults Survival, By Class¶

adult_by_class = adult_passengers.groupby('Pclass')

adult_survived_by_class = adult_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
adult_drowned_by_class = adult_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')

fig = plt.figure(figsize=(9,5))

draw_pie_subplot( adult_survived_by_class, 121, 'Surviving Adult By Class', 'DEFAULT')

draw_pie_subplot( adult_drowned_by_class, 122, 'Victim Adult By Class', 'DEFAULT')

plt.tight_layout()

Class' Elderly Survival, Broken Down By Both The Total And The Gender¶

#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_elderly_by_survival = first_class_elderly_passengers.groupby("Survived")
first_class_elderly_male_by_survival = first_class_elderly_male_passengers.groupby("Survived")
first_class_elderly_female_by_survival = first_class_elderly_female_passengers.groupby("Survived")

#####Second Class
second_class_elderly_by_survival = second_class_elderly_passengers.groupby("Survived")
second_class_elderly_male_by_survival = second_class_elderly_male_passengers.groupby("Survived")
second_class_elderly_female_by_survival = second_class_elderly_female_passengers.groupby("Survived")

#####Third Class
third_class_elderly_by_survival = third_class_elderly_passengers.groupby("Survived")
third_class_elderly_male_by_survival = third_class_elderly_male_passengers.groupby("Survived")
third_class_elderly_female_by_survival = third_class_elderly_female_passengers.groupby("Survived")

#Get the count of each category created in the cell above
first_class_elderly_surviving_count = get_surviving_count( first_class_elderly_by_survival)
first_class_elderly_victims_count = get_victim_count(first_class_elderly_by_survival)
first_class_elderly_male_surviving_count = get_surviving_count( first_class_elderly_male_by_survival)
first_class_elderly_male_victims_count = get_victim_count(first_class_elderly_male_by_survival)
first_class_elderly_female_surviving_count = get_surviving_count(first_class_elderly_female_by_survival)
first_class_elderly_female_victims_count = get_victim_count(first_class_elderly_female_by_survival)

second_class_elderly_surviving_count = get_surviving_count( second_class_elderly_by_survival)
second_class_elderly_victims_count = get_victim_count(second_class_elderly_by_survival)
second_class_elderly_male_surviving_count = get_surviving_count( second_class_elderly_male_by_survival)
second_class_elderly_male_victims_count = get_victim_count(second_class_elderly_male_by_survival)
second_class_elderly_female_surviving_count = get_surviving_count(second_class_elderly_female_by_survival)
second_class_elderly_female_victims_count = get_victim_count(second_class_elderly_female_by_survival)

third_class_elderly_surviving_count = get_surviving_count( third_class_elderly_by_survival)
third_class_elderly_victims_count = get_victim_count(third_class_elderly_by_survival)
third_class_elderly_male_surviving_count = get_surviving_count( third_class_elderly_male_by_survival)
third_class_elderly_male_victims_count = get_victim_count(third_class_elderly_male_by_survival)
third_class_elderly_female_surviving_count = get_surviving_count(third_class_elderly_female_by_survival)
third_class_elderly_female_victims_count = get_victim_count(third_class_elderly_female_by_survival)

#Create the dataframe of the classes' elderly by survival
df = pd.DataFrame({'Survived': [first_class_elderly_surviving_count, second_class_elderly_surviving_count, third_class_elderly_surviving_count], \
                   'Died':     [first_class_elderly_victims_count, second_class_elderly_victims_count, third_class_elderly_victims_count]},\
                   index = ["First Class", "Second Class", "Third Class"])

#Display the dataframe as an HTML table
display_html(df)

#Create the dataframe of the classes' elderly by survival
df = pd.DataFrame({'First Class':  [first_class_elderly_male_surviving_count, \
                                    first_class_elderly_male_victims_count, \
                                    first_class_elderly_female_surviving_count, \
                                    first_class_elderly_female_victims_count], \
                   
                   'Second Class': [second_class_elderly_male_surviving_count, \
                                    second_class_elderly_male_victims_count, \
                                    second_class_elderly_female_surviving_count, \
                                    second_class_elderly_female_victims_count],\
                   
                   'Third Class':  [third_class_elderly_male_surviving_count, \
                                    third_class_elderly_male_victims_count, \
                                    third_class_elderly_female_surviving_count, \
                                    third_class_elderly_female_victims_count]},\
                   index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )

#Display the dataframe as an HTML table
display_html(df)


fig = plt.figure(figsize=(15,12))
#First Class
draw_pie_subplot( first_class_elderly_by_survival , 331, 'First Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_elderly_male_by_survival , 332, 'First Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_elderly_female_by_survival , 333, 'First Class Female Elderly', 'SURVIVAL_GRAPH')

#Second Class
draw_pie_subplot( second_class_elderly_by_survival , 334, 'Second Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_elderly_male_by_survival , 335, 'Second Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_elderly_female_by_survival , 336, 'Second Class Female Elderly', 'SURVIVAL_GRAPH')

#Third Class
draw_pie_subplot( third_class_elderly_by_survival , 337, 'Third Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_elderly_male_by_survival , 338, 'Third Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_elderly_female_by_survival , 339, 'Third Class Female Elderly', 'SURVIVAL_GRAPH')

plt.tight_layout()

Total Elderly Survival, By Class¶

elderly_by_class = elderly_passengers.groupby('Pclass')

elderly_survived_by_class = elderly_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
elderly_drowned_by_class = elderly_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')

fig = plt.figure(figsize=(9,5))

draw_pie_subplot( elderly_survived_by_class, 121, 'Victim Elderly By Class', 'DEFAULT')

draw_pie_subplot( elderly_drowned_by_class, 122, 'Victim Elderly By Class', 'DEFAULT')

plt.tight_layout()

Survival¶

Because of the tragedy, I feel that we should look again in the survival of the individual. Some of the data that will be presented here can be redundunt with what have been presented above, already. But I think we should still re-examine the survival from angles.

Survivors vs Victims age Pyramid¶

age_bins = range(0,75,5)

survivors_ages = get_count(surviving_passengers.groupby( pd.cut( surviving_passengers["Age"], np.arange(0, 80, 5) ) ))
victims_ages = get_count(victim_passengers.groupby( pd.cut( victim_passengers["Age"], np.arange(0, 80, 5) ) ))

largest_x_value = max(survivors_ages.max(), victims_ages.max() )

plot_population_pyramid(age_bins, "Survivors", survivors_ages, "Victims", victims_ages, largest_x_value)

Male Survivors vs Male Victims Age Pyramid¶

surviving_males = surviving_passengers[ surviving_passengers['Sex'] == 'male']
victim_males = victim_passengers[ victim_passengers['Sex'] == 'male']

age_bins = range(0,75,5)

surviving_males_ages = get_count(surviving_males.groupby( pd.cut( surviving_males["Age"], np.arange(0, 80, 5) ) ))
victim_males_ages = get_count(victim_males.groupby( pd.cut( victim_males["Age"], np.arange(0, 80, 5) ) ))

largest_x_value = max(surviving_males_ages.max(), victim_males_ages.max() )

plot_population_pyramid(age_bins, "Surviving Males", surviving_males_ages, "Victim Males", victim_males_ages, largest_x_value)

Female Survivors vs Female Victims Age Pyramid¶

surviving_females = surviving_passengers[ surviving_passengers['Sex'] == 'female']
victim_females = victim_passengers[ victim_passengers['Sex'] == 'female']

age_bins = range(0,75,5)

surviving_females_ages = get_count(surviving_females.groupby( pd.cut( surviving_females["Age"], np.arange(0, 80, 5) ) ))
victim_females_ages = get_count(victim_females.groupby( pd.cut( victim_females["Age"], np.arange(0, 80, 5) ) ))

largest_x_value = max(surviving_females_ages.max(), victim_females_ages.max() )

plot_population_pyramid(age_bins, \
                        "Surviving Females", \
                        surviving_females_ages, \
                        "Victim Females", \
                        victim_females_ages, \
                        largest_x_value)

How Were The Boats Divided Among Different Classes ?¶

fig = plt.figure()


fig.add_subplot(121)
titanic_all_data['Survived'].groupby(titanic_all_data['Pclass']).sum().plot.pie(label = 'Total Passengers Saved, by Class', \
                                                                                autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(122)
titanic_all_data['Survived'].groupby(titanic_all_data['Pclass']).sum().plot.bar(label = 'Total Passengers Saved, by Class')

plt.tight_layout()

It looks much more fair than what I have expected. Of course one can argue here that, the majority of women and children of the upper classes were already saved, and this is why the pie chart looks evenly distributed; but I would still say that it is not as bad as I thought before I have plot these graphs.

Survivors, By Generation, General¶

fig = plt.figure(figsize=(9,5))


fig.add_subplot(121)
titanic_all_data['Survived'].groupby(titanic_all_data['Generation']).sum().plot.pie(label = 'Total Passengers Saved, by Generation',\
                                                                                    autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(122)
titanic_all_data['Survived'].groupby(titanic_all_data['Generation']).sum().plot.bar(label = 'Total Passengers Saved, by Generation')

plt.tight_layout()

Victims, By Generation, General¶

fig = plt.figure(figsize=(9,5))

fig.add_subplot(121)
victim_passengers['Survived'].groupby(victim_passengers['Generation']).count().plot.pie(label = 'Victims, by Generation', \
                                                                                        autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(122)
victim_passengers['Survived'].groupby(victim_passengers['Generation']).count().plot.bar(label = 'Victims, by Generation')

plt.tight_layout()

Survivors, By Generation and Class¶

fig = plt.figure(figsize=(15,6))

fig.add_subplot(131)
#First class plot
first_class_survivors['Survived'].groupby(first_class_survivors['Generation']).count().plot.pie(label = 'First Class', \
                                                                                                autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(132)
#Second class plot
second_class_survivors['Survived'].groupby(second_class_survivors['Generation']).count().plot.pie(label = 'Second Class', \
                                                                                                  autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(133)
#Third class plot
third_class_survivors['Survived'].groupby(third_class_survivors['Generation']).count().plot.pie(label = 'Third Class', \
                                                                                                autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Victims, By Generation and Class¶

fig = plt.figure(figsize=(15,6))

fig.add_subplot(131)
#First class plot
first_class_victims['Survived'].groupby(first_class_victims['Generation']).count().plot.pie(label = 'First Class', \
                                                                                            autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(132)
#Second class plot
second_class_victims['Survived'].groupby(second_class_victims['Generation']).count().plot.pie(label = 'Second Class', \
                                                                                              autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(133)
#Third class plot
third_class_victims['Survived'].groupby(third_class_victims['Generation']).count().plot.pie(label = 'Third Class', \
                                                                                            autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Travel Companionship¶

Now, after gaining some intuition about the data, one final parameter is left to explore so we can start answering our first question: companionship.

Solo Travelers vs Group Travelers, Total¶

solo_passengers_count =  get_count(passengers_by_companionship.get_group(True))
group_passengers_count = get_count(passengers_by_companionship.get_group(False))


df = pd.DataFrame({'Count': [solo_passengers_count, group_passengers_count]}, index = ['Solo', 'Group'])

display_html(df)

get_count(titanic_all_data.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
                                                                   label = "Companionship", \
                                                                   autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Travel Companionship, by Class Make Up¶

group_travelers_by_class = passengers_by_class.apply(lambda t : t[t['isSolo'] == False]).groupby('Pclass')
solo_travelers_by_class = passengers_by_class.apply(lambda t : t[t['isSolo'] == True]).groupby('Pclass')


fig = plt.figure(figsize=(9,5))

draw_pie_subplot( group_travelers_by_class, 121, 'Group Travelers By Class', 'DEFAULT')
draw_pie_subplot( solo_travelers_by_class, 122, 'Solo Travelers By Class', 'DEFAULT')

plt.tight_layout()

Solo Travelers vs Group Travelers, By Class¶

solo_passengers_count =  get_count(passengers_by_companionship.get_group(True))
group_passengers_count = get_count(passengers_by_companionship.get_group(False))

first_class_solo_passengers_count =  get_count(first_class_passengers[ first_class_passengers['isSolo'] == True])
first_class_group_passengers_count =  get_count(first_class_passengers[ first_class_passengers['isSolo'] == False])

second_class_solo_passengers_count =  get_count(second_class_passengers[ second_class_passengers['isSolo'] == True])
second_class_group_passengers_count =  get_count(second_class_passengers[ second_class_passengers['isSolo'] == False])

third_class_solo_passengers_count =  get_count(third_class_passengers[ third_class_passengers['isSolo'] == True])
third_class_group_passengers_count =  get_count(third_class_passengers[ third_class_passengers['isSolo'] == False])

df = pd.DataFrame({'First Class':  [first_class_solo_passengers_count, first_class_group_passengers_count], \
                   'Second Class': [second_class_solo_passengers_count, second_class_group_passengers_count], \
                   'Third Class':  [third_class_solo_passengers_count, third_class_group_passengers_count]}, \
                  index = ['Group', 'Solo'])

display_html(df)

fig = plt.figure(figsize=(12,5))

fig.add_subplot(131)
get_count(first_class_passengers.groupby('isSolo')).plot.pie(label = 'First Class', \
                                                                         labels=['Group', 'Solo'],\
                                                                         autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(132)
get_count(second_class_passengers.groupby('isSolo')).plot.pie(label = 'Second Class', \
                                                                         labels=['Group', 'Solo'],\
                                                                          autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(133)
get_count(third_class_passengers.groupby('isSolo')).plot.pie(label = 'Third Class', \
                                                                         labels=['Group', 'Solo'],\
                                                                         autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Solo Travelers Age Pyramid¶

age_bins = range(16,76,4)

solo_female_passengers = female_passengers[ female_passengers['isSolo'] == True ]
solo_male_passengers = male_passengers[ male_passengers['isSolo'] == True ]

grouped_female_ages = get_count(solo_female_passengers.groupby( pd.cut( solo_female_passengers["Age"], np.arange(16, 80, 4) ) ))
grouped_male_ages =   get_count(solo_male_passengers.groupby( pd.cut( solo_male_passengers["Age"], np.arange(16, 80, 4) ) ))

largest_x_value = max(grouped_female_ages.max(), grouped_male_ages.max())

plot_population_pyramid(age_bins, \
                        "Solo Male Population", \
                        grouped_male_ages, \
                        "Solo Female Population", \
                        grouped_female_ages, \
                        largest_x_value)

Women Traveling Companionship Pattern¶

I got curious about solo women traveling on the ship, since the accident happened during a time when the culture was more conservative than nowadays.

get_count(passengers_by_gender.get_group('female').groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
                                                                                           label = " Women Companionship", \
                                                                                           autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

group_women_count = get_count(passengers_by_gender.get_group('female').groupby('isSolo'))[0]
solo_women_count = get_count(passengers_by_gender.get_group('female').groupby('isSolo'))[1]

print "Women traveling alone:      " + str(solo_women_count) + "  Percentage: " + str ( format((float(solo_women_count) / female_count )*100.0, '.2f')) + "%"
print "Women traveling in a group: "+ str(group_women_count) + "  Percentage: " + str( format((float(group_women_count) / female_count )*100.0, '.2f')) + "%"

Women traveling alone:      118  Percentage: 37.58%
Women traveling in a group: 196  Percentage: 62.42%

Women Traveling Companionship, by Class¶

fig = plt.figure(figsize=(12,5))

fig.add_subplot(131)
get_count(first_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                            label = 'First Class', \
                                                                            autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(132)
get_count(second_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                             label = 'Second Class', \
                                                                             autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(133)
get_count(third_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                            label = 'Third Class', \
                                                                            autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Solo Women Traveling, by Class¶

first_class_solo_women_passengers =  solo_female_passengers[ solo_female_passengers['Pclass'] == 1 ]
second_class_solo_women_passengers =  solo_female_passengers[ solo_female_passengers['Pclass'] == 2 ]
third_class_solo_women_passengers =  solo_female_passengers[ solo_female_passengers['Pclass'] == 3 ]

fig = plt.figure(figsize=(7,5))
fig.add_subplot(121)
get_count(solo_female_passengers.groupby('Pclass')).plot.pie(label = 'Solo Women by Class',\
                                                                         labels = ['1st Class', '2nd Class', '3rd Class'],\
                                                                                autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(122)
get_count(solo_female_passengers.groupby('Pclass')).plot.bar()

plt.tight_layout()

The class, visually, does not look to like it has an effect over the women's traveling companionship pattern.

Men Traveling Companionship Pattern¶

Now, let us have a closer look at the male's traveling companionship; just to make the picture complete:

get_count(passengers_by_gender.get_group('male').groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
                                                                                         label = " Men Companionship", \
                                                                                         autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

group_men_count = get_count(passengers_by_gender.get_group('male').groupby('isSolo'))[0]
solo_men_count = get_count(passengers_by_gender.get_group('male').groupby('isSolo'))[1]

print "Men traveling alone:      " + str(solo_men_count) + "  Percentage: " + str ( format((float(solo_men_count) / male_count )*100.0, '.2f')) + "%"
print "Men traveling in a group: "+ str(group_men_count) + "  Percentage: " + str( format((float(group_men_count) / male_count )*100.0, '.2f')) + "%"

Men traveling alone:      403  Percentage: 69.84%
Men traveling in a group: 174  Percentage: 30.16%

Men Traveling Companionship, by Class¶

fig = plt.figure(figsize=(12,5))

fig.add_subplot(131)
get_count(first_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                              label = 'First Class', \
                                                                              autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(132)
get_count(second_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                               label = 'Second Class', \
                                                                               autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(133)
get_count(third_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
                                                                              label = 'Third Class', \
                                                                              autopct='%1.1f%%')
plt.axis('equal')

plt.tight_layout()

Solo Men Traveling, by Class¶

first_class_solo_men_passengers =  solo_male_passengers[ solo_male_passengers['Pclass'] == 1 ]
second_class_solo_men_passengers =  solo_male_passengers[ solo_male_passengers['Pclass'] == 2 ]
third_class_solo_men_passengers =  solo_male_passengers[ solo_male_passengers['Pclass'] == 3 ]

fig = plt.figure()
fig.add_subplot(121)
get_count(solo_male_passengers.groupby('Pclass')).plot.pie(label = 'Solo Women by Class', \
                                                                       autopct='%1.1f%%')
plt.axis('equal')

fig.add_subplot(122)
get_count(solo_male_passengers.groupby('Pclass')).plot.bar()

plt.tight_layout()

Survival By Companionship¶

Now, down to the visualization of our first question: how does survival chance look like when seen through the lense of companionship?

fig = plt.figure(figsize=(9,6))

draw_pie_subplot( surviving_passengers.groupby(surviving_passengers['isSolo']) , 121, 'Solo Survival', 'SURVIVAL_GRAPH')

draw_pie_subplot( victim_passengers.groupby(victim_passengers['isSolo']) , 122, 'Group Survival', 'SURVIVAL_GRAPH')

plt.tight_layout()

Analysis¶

Q1: What is the effect of Traveling Companionship Over the Survival of a Grown Up Passengers ?¶

Descriptive Statistics¶

Total Companions, For All Passengers¶

titanic_all_data['TotalCompanions'].describe()

count    891.000000
mean       0.904602
std        1.613459
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max       10.000000
Name: TotalCompanions, dtype: float64

Total Companions, For Group Travelers¶

group_passengers = passengers_by_companionship.get_group(False);
solo_passengers = passengers_by_companionship.get_group(True)

group_passengers['TotalCompanions'].describe()

count    370.000000
mean       2.178378
std        1.869906
min        0.000000
25%        1.000000
50%        2.000000
75%        2.000000
max       10.000000
Name: TotalCompanions, dtype: float64

I think it is noteworthy to remind here that, the data has 0 as its minimum for the total number of companions (although we are examining the group passengers) is due to the choice of including children with the group travelers.

The statistics for solo travelers will not be performed, since all the data (Except to the count) is equal to zero

group_passengers['Survived'].describe()

count    370.000000
mean       0.505405
std        0.500648
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

solo_passengers['Survived'].describe()

count    521.000000
mean       0.297505
std        0.457600
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

Hypothesis Testing:¶

The contingency table will be built so we can examine the relationship and perform the hypothesis testing

#Expected frequency: Total survival, total drowning
#Solo
group_passengers_survived_count = get_count(group_passengers.groupby('Survived').get_group(1))
group_passengers_victims_count = get_count(group_passengers.groupby('Survived').get_group(0))

#Group
solo_passengers_survived_count = get_count(solo_passengers.groupby('Survived').get_group(1))
solo_passengers_victims_count = get_count(solo_passengers.groupby('Survived').get_group(0))

# passengers_by_survival
survivors_count_list = [group_passengers_survived_count, solo_passengers_survived_count,  total_survivors_count]
victims_count_list = [group_passengers_victims_count, solo_passengers_victims_count, total_victims_count]
total_count_list = [get_count(group_passengers), get_count(solo_passengers), sample_size]

#Had to pass the - redundunt - column argument to skip Pandas default ordering
contingency_table = pd.DataFrame({'Survived': survivors_count_list , \
                                  'Victims': victims_count_list , \
                                  'Total, Group': total_count_list},\
                                   index = ['Group', 'Solo', 'Total, State'], columns = [ 'Survived', 'Victims', 'Total, Group'])

display_html(contingency_table)

Now, all is set to calculate the Χ²

chi_squared, p, degrees_of_freedom, expected_frequency = scipy.stats.chi2_contingency( contingency_table )

print "Chi Squared: ", chi_squared
print "p value: ", p
print "Degrees of Freedom", degrees_of_freedom
print "Expected Frequency for The Group Passengers:", expected_frequency[0]
print "Expected Frequency for The Solo Passengers:", expected_frequency[1]

Chi Squared:  39.539412799
p value:  5.38959286299e-08
Degrees of Freedom 4
Expected Frequency for The Group Passengers: [ 142.02020202  227.97979798  370.        ]
Expected Frequency for The Solo Passengers: [ 199.97979798  321.02020202  521.        ]

The statistical significance here is very high, but I this can be affected by children: Probably there are a good portion of children traveling with their families, and that can lead to a higher than usual women and children presence in the group passengers, and lower than usual presence of women in the solo group (By our definition of a solo group, a child cannot be solo even if Parch and SibSp are equal to zero). Let us examine how much women and children are present in the groups.

women_and_children_group_passengers = titanic_all_data[ (titanic_all_data['Sex'] == 'female') | \
                                                       (titanic_all_data['Generation'] == 'Child')]
women_and_children_group_passengers_count = get_count(women_and_children_group_passengers)

print "Number of women and children in the group travelers: ", women_and_children_group_passengers_count
print "Percentage of women and children in the group traveler: ",\
(float(women_and_children_group_passengers_count)/ get_count(group_passengers))*100.0 , "%"

Number of women and children in the group travelers:  365
Percentage of women and children in the group traveler:  98.6486486486 %

I was a bit shocked at first about such a high percentage. The percentage of grown up men(Males above 17 years old) constitute less than 2% of the total passengers that are traveling within their immediate family. But after a second thought, that can make sense because there could be only one grown up man within a family made of both parents and their children.

There could be other instances, of course, than a family made of both parents and children that can bring more grown up men to the count, like adult male brothers traveling together or a grown up son traveling with his elderly father

I will try to refine the hypothesis, by excluding children from the group data, and just compare grown up passengers (adults and elderly) from both groups.

Note: The two grownup variables do not have the members whose age is unknown, they contain only members that we know for sure that they are adults

passengers_by_generation = titanic_all_data.groupby('Generation')

#The output will be a dictionary, with keys Adult and Elderly
passengers_by_grown_up = {key: value for key, value in passengers_by_generation if key in ['Adult', 'Elderly']}

#Now, expand the values of the dictionary into a dataframe
grown_up_passengers = pd.concat([grown_up_passengers_values for keys, grown_up_passengers_values in passengers_by_grown_up.items()])

#Get the count of survival. This will be used as our expected count for the Chi square computation
grown_up_survival_count = get_count(grown_up_passengers.groupby('Survived'))

#The append function did not behave as I expected, ie it did not perform its action in place nor did it have an inplace parameter. 
#I had to take the whole thing and make it equal to the Series in question
#Also, the index = [2] is done to prevent index duplicates, otherwise the total will be added with index 0. It would not really 
#affect the calculation, but it affected when I wanted to rename the indices for readability
grown_up_survival_count = grown_up_survival_count.append( pd.Series(len(grown_up_passengers), index = [2] )) 

#Separate passengers by companionship
group_grown_up_passengers = grown_up_passengers.groupby('isSolo').get_group(0)
solo_grown_up_passengers = grown_up_passengers.groupby('isSolo').get_group(1)

#Get the survival for each group and solo grown up passengers
group_grown_up_passengers_survival_count = get_count(group_grown_up_passengers.groupby('Survived'))
solo_grown_up_passengers_survival_count = get_count(solo_grown_up_passengers.groupby('Survived'))

#Append the total to each series
group_grown_up_passengers_survival_count = group_grown_up_passengers_survival_count.append( pd.Series(len(group_grown_up_passengers), index = [2] )) 
solo_grown_up_passengers_survival_count = solo_grown_up_passengers_survival_count.append( pd.Series(len(solo_grown_up_passengers), index = [2] ))


#Build the contingency table
contingency_table = pd.concat([ group_grown_up_passengers_survival_count, solo_grown_up_passengers_survival_count, grown_up_survival_count],\
                              axis=1, \
                              keys ={'Group','Solo','Total'})

contingency_table.rename(index= { 0 : 'Victims', 1 : 'Survivors', 2 : 'Total'},\
                         inplace = True)

contingency_table

Next, it is the Chi square for the adult passengers, by companionship:

chi_squared, p, degrees_of_freedom, expected_frequency = scipy.stats.chi2_contingency( contingency_table )

print "Chi Squared: ", chi_squared
print "p value: ", p
print "Degrees of Freedom", degrees_of_freedom
print "Expected Frequency for The Group Passengers:", expected_frequency[0]
print "Expected Frequency for The Solo Passengers:", expected_frequency[1]

Chi Squared:  20.8162744573
p value:  0.000344364339067
Degrees of Freedom 4
Expected Frequency for The Group Passengers: [ 139.50162866  239.49837134  379.        ]
Expected Frequency for The Solo Passengers: [  86.49837134  148.50162866  235.        ]

Well, the results were shifted towards the mean by orders of magnitude after removing the children, but still they are very statistically significant even under the most conservative standards. The chances to get such a sample is about 1 in 3000, very low indeed.

Calculate The Effect Size (Cohen d)¶

avergae_std = (group_passengers['Survived'].std() + solo_passengers['Survived'].std()) / 2
cohens_d = abs(solo_passengers['Survived'].mean() - group_passengers['Survived'].mean() )/avergae_std
print "Cohen d: ", cohens_d

Cohen d:  0.43391833554

Conclusion For Q1:¶

**p < α****Χ² < Χ²_Critical**

Χ² = 20.82, p < 0.001, two tailed

Effect Size Measures:

d = 0.43

NB: I have based this way of writing up the conclusion from the book "Statistics in plain english"

A chi-square analysis was performed to determine whether traveling companionship affected the chances of survival for a passenger.The analysis produced a significant χ2 value (39.54, df = 4, p < .001), indicating that traveling with first degree family affected the chances of survival. The question was then refined, by removing the children passengers since they had the best survival chances, and were absent from the Solo passengers group because of how we defined who is a solo traveler. The new χ2 value remained significant (20.82, df = 4, p < .001) although less than the original question by orders of magnitude. Therefore, we must reject the null hypothesis.

The limitation I see in this analysis is the gender bias between the two groups. The majority of solo travelers are men (And, mostly from the third class), and that can be an alternative explanation for the significance. It might have been appropriate to explore the question more, by separating gender from each group and then compare each gender from each group to each other (ie Solo women vs Group women AND Solo men vs Group men), but unfortunately the count of adult men within the group traveler was too low to perform such a test. If the dataset was complete, such a test may have been feasible.

The limitation I have found is that, since the question under investigation involved categorizing passengers based over their age, missing ages meant that the test was not done all over the sample, but only a fraction of it. I have opted to omit passengers with missing ages rather than making any other assumptions (Like assuming their age is mean\median, or even assume that their age distribution is the same as the that of the bigger sample) since I had a good amount of data already to run the test; so there was no need to take any risks by making extra assumptions.

The data field that I wanted to have, although it can be hard to get - especially for adult passengers -, is who are they traveling with other than the immediate family. Cousins, friends..etc could have proved useful for such a question. The reason I have picked this question from the first place is that, I imagined what might have been the situation on deck during such a hard time, and I though that a group of people who care for each other can be useful in such an extreme situation: they can push, fight, beg or even something illegal like bribe to save all the rest of the group, a priviliged support that a solo traveler would not have. Of course the immediate family might be the most aggressive\protective, but still cousins and close friends can provide a comparable support.

Q2: Is It Possible From The Provided Data To Identify Family Members?¶

The problem with this question is that this is only a partial data, so numbers will not add up: there could be other family members in the other set. But I will proceed anyway. So here are my assumptions:

Same family travels within the same class
Same family bears the same last name
Same family share the same number of TotalCompanions
Children with TotalCompanions == 0 will be excluded. Of course common sense tells that they are not traveling alone, but the thing is that, according to the provided data, they are not traveling with their immediate family.

group_family = group_passengers[ group_passengers['TotalCompanions'] > 0 ]
group_by_family_size = group_family.groupby(['TotalCompanions','Pclass', 'LastName'])#, sort=False)



#We are going to create three lists, one for all the families whose members have NOT survived
#Another for the families which had some members who survived but not all
#The third is for the families whose all members were able to make it
families_totally_perished = []
families_partially_saved = []
families_totally_saved = []

#Now, let's loop over the groupby of family by size created above, and check which family belongs 
#to which of the three lists created above

#Loop over the group
#tc = TotalCompanions, pcls = pclass, lname = LastName. There was no need to unpack the keys, but I am going to leave it
#that way.
for (tc, pcls, lname), data in group_by_family_size:
    n = 0

    #A variable to count how many members survived within the same family
    saved_count = 0

    #Loop over each member within the same family
    for current_family_member in range(len(data)):    
        #If the current member in the loop has survived, increment the saved count
        if data.iloc[current_family_member]['Survived'] == 1:
            saved_count += 1
    #After looping over all family members, now let us check the count to decide in which of the 3 lists
    #are we going to add the family to
    if saved_count == 0:
        families_totally_perished.append(data)
    #If the total count is not equal to the total companions + 1. We add one here because the family size
    # is always greater than the total companions by one. This is because the count always tells the number
    #of other family members within the family of the passenger, but does not count the passenger himself\herself
    elif (saved_count != (data.iloc[current_family_member]['TotalCompanions'] + 1) ):
        families_partially_saved.append(data)
    else:
        families_totally_saved.append(data)

        
#A function that will display each list as a table. The three lists to display are:
#families_totally_perished
#families_partially_saved
#families_totally_saved
def display_family_list(family_list):
    #Loop over each family within the passed list
    for family in family_list:
        #Create the container dataframe. THis data frame will hold temporarily all members that belong to the same family
        #Later on, this dataframe will be used to display the family in a table
        df = pd.DataFrame( columns=("Generation", 'Sex', 'Age', 'Name', 'Class','PassengerId'))

        #Print the name of the family, and then print on a new line how many members the family has
        print "The " + str(family.iloc[0]['LastName']) + " Family: "
        print "The Family had " + str(family.iloc[0]['TotalCompanions'] + 1) + " members on board"


        #Write how many family members from the current family were present in the data set
        #If the number of family members within the set is equal to the total family size (Companions + 1),
        #the print that all family members were present in the data set
        if( (len(family)) == (family.iloc[0]['TotalCompanions'] + 1) ):
            print "All of the family members were available in the dataset"
        #Else, write how many family members were present in the data set
        else:
            #Just some nice formatting, divide the singular and plural
            if(len(family) == 1):
                print ' --> Only one member was available in the dataset'
            else:
                print " --> Only "+ str(len(family)) + " members were available in the dataset"

        #Loop over each family member within the current family
        for i in range(len(family)):
            #Append the current family member to the dataframe, so that member would be displayed inside the same table.
            df = df.append( {"Generation":family.iloc[i]['Generation'],\
                             'Sex':family.iloc[i]['Sex'], \
                             'Age':family.iloc[i]['Age'], \
                             'Name': family.iloc[i]['Name'],\
                             'Class': family.iloc[i]['Pclass'],\
                             'PassengerId': family.iloc[i]['PassengerId'] } , ignore_index=True )
            #df = df.append([1,2,3,4],ignore_index=True)
        display_html(df)
        print '_________________________________________________________________'

Families Whose Members Within The Dataset That Never Made It¶

display_family_list(families_totally_perished)

The Cavendish Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Chaffee Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Davidson Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Douglas Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Marvin Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Natsch Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Ostby Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The White Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Williams Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Bryhl Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

Families Whose Members Within The Dataset Partially Survived¶

display_family_list(families_partially_saved)

The Andrews Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Astor Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Baxter Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Bowerman Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Cardeza Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Chibnall Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Cumings Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Eustis Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Frauenthal Family: 
The Family had 2 members on board
 --> Only one member was available in the dataset

_________________________________________________________________
The Futrelle Family: 
The Family had 2 members on board
All of the family members were available in the dataset

Families Whose All Members Within The Dataset Survived¶

display_family_list(families_totally_saved)

The Bishop Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Chambers Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Dick Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Duff Gordon Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Goldenberg Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Harper Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Hippach Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Hoyt Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Newell Family: 
The Family had 2 members on board
All of the family members were available in the dataset

_________________________________________________________________
The Taylor Family: 
The Family had 2 members on board
All of the family members were available in the dataset

The dataset

	Adults	Children	Elderly	Unspecified Age
First Class	133	9	44	30
Second Class	133	21	19	11
Third Class	274	70	44	103

	First Class	Second Class	Third Class
Male Survivor	3	9	10
Male Victim	0	2	27
Female Survivor	5	10	18
Female Victim	1	0	15

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	21	White, Mr. Richard Frasar	1	103
1	Elderly	male	54	White, Mr. Percival Wayland	1	125

	Generation	Sex	Age	Name	Class	PassengerId
0	Elderly	male	54	Carter, Rev. Ernest Courtenay	2	250
1	Adult	female	44	Carter, Mrs. Ernest Courtenay (Lilian Hughes)	2	855

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	27	Turpin, Mrs. William John Robert (Dorothy Ann ...	2	42
1	Adult	male	29	Turpin, Mr. William John Robert	2	118

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	18	Arnold-Franchi, Mrs. Josef (Josefine Franchi)	3	50
1	Adult	male	25	Arnold-Franchi, Mr. Josef	3	354

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	45	Barbara, Mrs. (Catherine David)	3	363
1	Adult	female	18	Barbara, Miss. Saiide	3	703

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	22	Braund, Mr. Owen Harris	3	1
1	Adult	male	29	Braund, Mr. Lewis Richard	3	478

	Generation	Sex	Age	Name	Class	PassengerId
0	Unspecified age	male	NaN	Hagland, Mr. Ingvald Olai Olsen	3	452
1	Unspecified age	male	NaN	Hagland, Mr. Konrad Mathias Reiersen	3	491

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	20	Jussila, Miss. Katriina	3	114
1	Adult	female	21	Jussila, Miss. Mari Aina	3	403

	Survived	Victims	Total, Group
Group	187	183	370
Solo	155	366	521
Total, State	342	549	891

	Solo	Total	Group
Survived
Victims	113	266	379
Survivors	113	122	235
Total	226	388	614

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	30	Lobb, Mr. William Arthur	3	254
1	Adult	female	26	Lobb, Mrs. William Arthur (Cordelia K Stanlick)	3	618

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	female	14.5	Zabour, Miss. Hileni	3	112
1	Unspecified age	female	NaN	Zabour, Miss. Thamine	3	241

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	21	Hickman, Mr. Stanley George	2	121
1	Adult	male	24	Hickman, Mr. Leonard Mark	2	656
2	Adult	male	32	Hickman, Mr. Lewis	2	666

	Generation	Sex	Age	Name	Class	PassengerId
0	Unspecified age	female	NaN	Boulos, Mrs. Joseph (Sultana)	3	141
1	Child	female	9	Boulos, Miss. Nourelain	3	853

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	40	Bourke, Mr. John	3	189
1	Unspecified age	female	NaN	Bourke, Miss. Mary	3	594
2	Adult	female	32	Bourke, Mrs. John (Catherine)	3	658

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	28	Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ...	3	424
1	Adult	male	34	Danbom, Mr. Ernst Gilbert	3	617

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	15	Elias, Mr. Tannous	3	353
1	Adult	male	17	Elias, Mr. Joseph Jr	3	533

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	male	37	Gustafsson, Mr. Anders Vilhelm	3	105
1	Adult	male	28	Gustafsson, Mr. Johan Birger	3	393

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	41	Rosblom, Mrs. Viktor (Helena Wilhelmina)	3	255
1	Adult	male	18	Rosblom, Mr. Viktor Richard	3	425

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	female	10	Van Impe, Miss. Catharina	3	420
1	Adult	male	36	Van Impe, Mr. Jean Baptiste	3	596
2	Adult	female	30	Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...	3	800

	Generation	Sex	Age	Name	Class	PassengerId
0	Adult	female	18	Vander Planke, Miss. Augusta Maria	3	39
1	Child	male	16	Vander Planke, Mr. Leo Edmondus	3	334

	Generation	Sex	Age	Name	Class	PassengerId
0	Unspecified age	male	NaN	Johnston, Mr. Andrew G	3	784
1	Unspecified age	female	NaN	Johnston, Miss. Catherine Helen "Carrie"	3	889

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	16	Ford, Mr. William Neal	3	87
1	Child	female	9	Ford, Miss. Robina Maggie "Ruby"	3	148
2	Adult	female	21	Ford, Miss. Doolina Margaret "Daisy"	3	437
3	Adult	female	48	Ford, Mrs. Edward (Margaret Ann Watson)	3	737

	Generation	Sex	Age	Name	Class	PassengerId
0	Unspecified age	male	NaN	Lefebre, Master. Henry Forbes	3	177
1	Unspecified age	female	NaN	Lefebre, Miss. Mathilde	3	230
2	Unspecified age	female	NaN	Lefebre, Miss. Ida	3	410
3	Unspecified age	female	NaN	Lefebre, Miss. Jeannie	3	486

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	2	Palsson, Master. Gosta Leonard	3	8
1	Child	female	8	Palsson, Miss. Torborg Danira	3	25
2	Child	female	3	Palsson, Miss. Stina Viola	3	375
3	Adult	female	29	Palsson, Mrs. Nils (Alma Cornelia Berglund)	3	568

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	7	Panula, Master. Juha Niilo	3	51
1	Child	male	1	Panula, Master. Eino Viljami	3	165
2	Child	male	16	Panula, Mr. Ernesti Arvid	3	267
3	Adult	female	41	Panula, Mrs. Juha (Maria Emilia Ojala)	3	639
4	Child	male	14	Panula, Mr. Jaako Arnold	3	687
5	Child	male	2	Panula, Master. Urho Abraham	3	825

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	2	Rice, Master. Eugene	3	17
1	Child	male	4	Rice, Master. Arthur	3	172
2	Child	male	7	Rice, Master. Eric	3	279
3	Child	male	8	Rice, Master. George Hugh	3	788
4	Adult	female	39	Rice, Mrs. William (Margaret Norton)	3	886

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	4	Skoog, Master. Harald	3	64
1	Adult	female	45	Skoog, Mrs. William (Anna Bernhardina Karlsson)	3	168
2	Adult	male	40	Skoog, Mr. Wilhelm	3	361
3	Child	female	9	Skoog, Miss. Mabel	3	635
4	Child	female	2	Skoog, Miss. Margit Elizabeth	3	643
5	Child	male	10	Skoog, Master. Karl Thorsten	3	820

	Generation	Sex	Age	Name	Class	PassengerId
0	Child	male	11	Goodwin, Master. William Frederick	3	60
1	Child	female	16	Goodwin, Miss. Lillian Amy	3	72
2	Child	male	1	Goodwin, Master. Sidney Leonard	3	387
3	Child	male	9	Goodwin, Master. Harold Victor	3	481
4	Adult	female	43	Goodwin, Mrs. Frederick (Augusta Tyler)	3	679
5	Child	male	14	Goodwin, Mr. Charles Edward	3	684

	Generation	Sex	Age	Name	Class	PassengerId
0	Unspecified age	male	NaN	Sage, Master. Thomas Henry	3	160
1	Unspecified age	female	NaN	Sage, Miss. Constance Gladys	3	181
2	Unspecified age	male	NaN	Sage, Mr. Frederick	3	202
3	Unspecified age	male	NaN	Sage, Mr. George John Jr	3	325
4	Unspecified age	female	NaN	Sage, Miss. Stella Anna	3	793
5	Unspecified age	male	NaN	Sage, Mr. Douglas Bullen	3	847
6	Unspecified age	female	NaN	Sage, Miss. Dorothy Edith "Dolly"	3	864