Name: Emad Takla
DAND P2
Dataset chosen: Titanic
import unicodecsv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from IPython.display import HTML
from IPython.display import display_html
#Plot all figures within the html page
%pylab inline
#Make display better, like printing dataframes as tables
pd.set_option('display.notebook_repr_html', True)
titanic_all_data = pd.read_csv('./titanic_data.csv')
A function to map age into generation buckets: Child (0 to 16 years old), Adult (17 to 49 years old) and elderly (50 years old and over)
def age_to_generation(age):
#Check if the input is negative, return an error
if age < 0:
return 'Invalid age'
#If younger than 17, then it is classified as a child
if age < 17:
return 'Child'
#else if it is between 18 and 49 (both inclusive), the passed age is for an adult
if age < 50:
return 'Adult'
#Else if, the person is an elderly
if age >= 50:
return 'Elderly'
#Else, the input age was unspecified (Blank cell, NaN)
return 'Unspecified age'
A function to strip the last name from the column 'Name' The assumption is that the format is as follows: 'last_name, title. first_name (Alias\Maiden Name)'
def get_last_name(full_name):
return full_name.split(',')[0]
A function to plot a population pyramid, a horizontal histogram bar chart plotted back to back
# Plotting code coming from http://stackoverflow.com/questions/27694221/using-python-libraries-to-plot-two-horizontal-bar-charts-sharing-same-y-axis
#Binning Code coming from http://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values
def plot_population_pyramid(bins, left_title, left_data, right_title, right_data, largest_x_value):
fig, axes = plt.subplots(ncols=2, sharey=True)
#largest_x_value:
#this variable will be used to preserve scale. If not, the two graphs will extend to the maximum value of the dataset,
#and will be visually misleading
#largest_x_value = max(left_data.max(), right_data.max())
axes[0].barh(bins, left_data, align='center')
axes[0].set(title=left_title)
axes[0].set_xlim(0, largest_x_value)
axes[1].barh(bins, right_data, align='center')
axes[1].set(title=right_title)
axes[1].set_xlim(0, largest_x_value)
axes[0].invert_xaxis()
axes[0].set(yticks=bins)
axes[0].yaxis.tick_right()
Two functions to count the number of survivors\victims within a passed GroupBy object. The Function Accepts A GroupBy Object, Not a DataFrame!
The need for the function came when groupby can get an input that had by chance all victims\survivors, and there was no other group for the missing opposite value. This created index errors when I programmatically looped over the groups, and I had to check if the group was present in the data structure before using it
def get_surviving_count(group_obj):
#The values in Survived are either 1 or 0, with 1 indicating survivors
if 1 in group_obj.groups.keys():
return group_obj.get_group(1)['PassengerId'].count()
else:
return 0
def get_victim_count(group_obj):
#The values in Survived are either 1 or 0, with 0 indicating victims
if 0 in group_obj.groups.keys():
return group_obj.get_group(0)['PassengerId'].count()
else:
return 0
Graphing Helper Functions and Variables.
gender_colors = ['hotpink','dodgerblue']
survival_colors = ['red', 'limegreen']
The following function accepts a groupby object, that must group its original input only by Survived (ie only at maximum two groups, group 0 (victim) and group 1 (survived)
def plot_survival_pie_chart(group_obj, label):
graph_label = []
survival_colors = []
if 0 in group_obj.groups.keys():
graph_label.append('Victim')
survival_colors.append('red')
if 1 in group_obj.groups.keys():
graph_label.append('Survived')
survival_colors.append('limegreen')
group_obj['PassengerId'].count().plot.pie(label = label, autopct='%1.1f%%', colors=survival_colors,labels=graph_label)
A function to get the count of passengers in a passed dataframe
def get_count(df):
return df['PassengerId'].count()
A function to make subplots. It will be overloaded to have a version where we can pass the labels parameter
def draw_pie_subplot(groupby_data, subplot_position, graph_label, graph_type):
graph_colors = []
if(graph_type == 'SURVIVAL_GRAPH'):
graph_colors = survival_colors
elif(graph_type == 'GENDER_GRAPH'):
graph_colors = gender_colors
fig.add_subplot(subplot_position)
#If it's a survival graph, use the already created function to plot it
if (graph_type == 'SURVIVAL_GRAPH'):
plot_survival_pie_chart(groupby_data ,graph_label )
#Default colors. Do not pass the colors parameter to the plotting function
elif graph_type == 'DEFAULT':
get_count(groupby_data).plot.pie(label = graph_label, autopct='%1.1f%%')
else:
get_count(groupby_data).plot.pie(label = graph_label, \
autopct='%1.1f%%',\
colors = graph_colors)
plt.axis('equal')
http://www.durhamcollege.ca/wp-content/uploads/STAT_nullalternate_hypothesis.pdf
Null Hypothesis: There is NO difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.
Alternate Hypothesis: There IS a difference in the survival rate of passengers traveling with their immediate family and that of passengers traveling alone.
This is more of a data investigation question, not a statistical inference one. The data provides the last names, ages, traveling class and companionship (Spouses, Sibling - Parents, Children). Are these enough to make (partial) educated guesses about the family members? And if there is a success to do that, can we infer if having a bigger family improved the chances of survival ? (Possibly, since the rest of the family would pressure the crew to allow their left-behind family member on board of the life boats)
An interesting feature to show, is the generation to which the passenger belongs. There shall be three categories for that parameter (Child, Adult and Elderly) based on their age range. The breakdown is as follows
Any other values like NaN, negative numbers, blank fields, non-numerical values..etc will be noted as "Unspecified age"
The 'isSolo' field will be used to see if a passenger is traveling with his\her family or not. Here are the criterias used:
The total number of companions is the sum of the fields Parch and SibSp. It will be used for creating some descriptive statitistics about companionship
The 'LastName' field simply extracts the last name of the passenger from the full name provided. This will be useful in answering the second question.
#Adding the generation data field to the table. The generation can have three values: Child, Adult or Elderly
#The buckets ranges were arbitrarily chosen
titanic_all_data['Generation'] = titanic_all_data.loc[:,'Age'].apply(age_to_generation)
"""
If the traveler is a child (Under 17 years old), then automatically they are not a solo traveler (Can be traveling with a
nanny or a close family-friend..etc). Or, if the traveler has a non-zero value in any of SibSp and Parch, then they are not
solo Otherwise, they are a solo passenger, traveling alone
"""
#This one was tough, using the 'and' operator raised a ValueError, and it took me a while to find out that
#I should substitute it with bitwise &
titanic_all_data['isSolo'] = (titanic_all_data['SibSp'] + titanic_all_data['Parch'] == 0) & \
(titanic_all_data['Generation'] != 'Child')
titanic_all_data['TotalCompanions'] = titanic_all_data['Parch'] + titanic_all_data['SibSp']
#Stripping the last name of the passengers. This will be used to determine which passengers probably belong to the same family
titanic_all_data['LastName'] = titanic_all_data.loc[:,'Name'].apply(get_last_name)
#Passengers by Gender
male_passengers = titanic_all_data[titanic_all_data['Sex'] == 'male']
female_passengers = titanic_all_data[titanic_all_data['Sex'] == 'female']
#Passengers by Generation
children_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Child' ]
adult_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Adult' ]
elderly_passengers = titanic_all_data[ titanic_all_data['Generation'] == 'Elderly' ]
#Passengers by Survival
surviving_passengers = titanic_all_data[ titanic_all_data['Survived'] == 1 ]
victim_passengers = titanic_all_data[ titanic_all_data['Survived'] == 0 ]
In my opinion, groups are very straight forward when it comes to plot; but slicing dataframes is better for calculations. Just easier for me.
passengers_by_gender = titanic_all_data.groupby('Sex')
passengers_by_generation = titanic_all_data.groupby('Generation')
passengers_by_class = titanic_all_data.groupby('Pclass')
passengers_by_survival = titanic_all_data.groupby('Survived')
passengers_by_companionship = titanic_all_data.groupby('isSolo')
passengers_by_class_and_gender = titanic_all_data.groupby(['Pclass', 'Sex'])
passengers_by_class_and_generation = titanic_all_data.groupby(['Pclass', 'Generation'])
passengers_by_class_and_survival = titanic_all_data.groupby(['Pclass', 'Survived'])
passengers_by_generation_and_gender = titanic_all_data.groupby(['Generation', 'Sex'])
I have followed the instructions in the previous review by using groupby instead of slicing. However, I still cannot see the advantage in that. In fact, I think that the first way was more readable. For example: first_class_passengers[ first_class_passengers['Survived'] == 0 ] vs passengers_by_class_and_survival.get_group( (1,0) )
#Passengers by class
########################################################################################################################
#First Class Data Splitting:
first_class_passengers = passengers_by_class.get_group(1)
#Gender
first_class_male_passengers = passengers_by_class_and_gender.get_group( (1,'male') )
first_class_female_passengers = passengers_by_class_and_gender.get_group( (1,'female') )
#Children
first_class_children_passengers = passengers_by_class_and_generation.get_group( (1, 'Child') )
first_class_children_male_passengers = first_class_children_passengers.groupby('Sex').get_group('male')
first_class_children_female_passengers = first_class_children_passengers.groupby('Sex').get_group('female')
#Adults
first_class_adult_passengers = passengers_by_class_and_generation.get_group( (1, 'Adult') )
first_class_adult_male_passengers = first_class_adult_passengers.groupby('Sex').get_group('male')
first_class_adult_female_passengers = first_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
first_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
first_class_elderly_male_passengers = first_class_elderly_passengers.groupby('Sex').get_group('male')
first_class_elderly_female_passengers = first_class_elderly_passengers.groupby('Sex').get_group('female')
#First Class Survival
first_class_survivors = passengers_by_class_and_survival.get_group( (1,1) )
first_class_victims = passengers_by_class_and_survival.get_group( (1,0) )
########################################################################################################################
#Second Class Data Splitting:
second_class_passengers = passengers_by_class.get_group(2)
#Gender
second_class_male_passengers = passengers_by_class_and_gender.get_group( (2,'male') )
second_class_female_passengers = passengers_by_class_and_gender.get_group( (2,'female') )
#Children
second_class_children_passengers = passengers_by_class_and_generation.get_group( (2, 'Child') )
second_class_children_male_passengers = second_class_children_passengers.groupby('Sex').get_group('male')
second_class_children_female_passengers = second_class_children_passengers.groupby('Sex').get_group('female')
#Adults
second_class_adult_passengers = passengers_by_class_and_generation.get_group( (2, 'Adult') )
second_class_adult_male_passengers = second_class_adult_passengers.groupby('Sex').get_group('male')
second_class_adult_female_passengers = second_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
second_class_elderly_passengers = passengers_by_class_and_generation.get_group( (2, 'Elderly') )
second_class_elderly_male_passengers = second_class_elderly_passengers.groupby('Sex').get_group('male')
second_class_elderly_female_passengers = second_class_elderly_passengers.groupby('Sex').get_group('female')
#second Class Survival
second_class_survivors = passengers_by_class_and_survival.get_group( (2,1) )
second_class_victims = passengers_by_class_and_survival.get_group( (2,0) )
########################################################################################################################
#Third Class Data Splitting:
third_class_passengers = passengers_by_class.get_group(3)
#Gender
third_class_male_passengers = passengers_by_class_and_gender.get_group( (3,'male') )
third_class_female_passengers = passengers_by_class_and_gender.get_group( (3,'female') )
#Children
third_class_children_passengers = passengers_by_class_and_generation.get_group( (3, 'Child') )
third_class_children_male_passengers = third_class_children_passengers.groupby('Sex').get_group('male')
third_class_children_female_passengers = third_class_children_passengers.groupby('Sex').get_group('female')
#Adults
third_class_adult_passengers = passengers_by_class_and_generation.get_group( (3, 'Adult') )
third_class_adult_male_passengers = third_class_adult_passengers.groupby('Sex').get_group('male')
third_class_adult_female_passengers = third_class_adult_passengers.groupby('Sex').get_group('female')
#Elderly
third_class_elderly_passengers = passengers_by_class_and_generation.get_group( (1, 'Elderly') )
third_class_elderly_male_passengers = third_class_elderly_passengers.groupby('Sex').get_group('male')
third_class_elderly_female_passengers = third_class_elderly_passengers.groupby('Sex').get_group('female')
#third Class Survival
third_class_survivors = passengers_by_class_and_survival.get_group( (3,1) )
third_class_victims = passengers_by_class_and_survival.get_group( (3,0) )
sample_size = get_count(titanic_all_data)
male_count = get_count(passengers_by_gender.get_group('male'))
female_count = get_count(passengers_by_gender.get_group('female'))
children_count = get_count(children_passengers)
children_male_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'male')) )
children_female_count = get_count(passengers_by_generation_and_gender.get_group(('Child', 'female')) )
adult_count = get_count(adult_passengers)
adult_male_count = get_count( passengers_by_generation_and_gender.get_group(('Adult', 'male')) )
adult_female_count = get_count( passengers_by_generation_and_gender.get_group(('Child', 'female')) )
elderly_count = get_count(elderly_passengers)
elderly_male_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'male')) )
elderly_female_count = get_count( passengers_by_generation_and_gender.get_group(('Elderly', 'female')) )
first_class_count = get_count(first_class_passengers)
first_class_male_count = get_count(first_class_male_passengers)
first_class_female_count = get_count(first_class_female_passengers)
first_class_children_count = get_count(first_class_children_passengers)
first_class_children_male_count = get_count(first_class_children_male_passengers)
first_class_children_female_count = get_count(first_class_children_female_passengers)
first_class_adult_count = get_count(first_class_adult_passengers)
first_class_adult_male_count = get_count(first_class_adult_male_passengers)
first_class_adult_female_count = get_count(first_class_adult_female_passengers)
first_class_elderly_count = get_count(first_class_elderly_passengers)
first_class_elderly_male_count = get_count(first_class_elderly_male_passengers)
first_class_elderly_female_count = get_count(first_class_elderly_female_passengers)
second_class_count = get_count(second_class_passengers)
second_class_male_count = get_count(second_class_male_passengers)
second_class_female_count = get_count(second_class_female_passengers)
second_class_children_count = get_count(second_class_children_passengers)
second_class_children_male_count = get_count(second_class_children_male_passengers)
second_class_children_female_count = get_count(second_class_children_female_passengers)
second_class_adult_count = get_count(second_class_adult_passengers)
second_class_adult_male_count = get_count(second_class_adult_male_passengers)
second_class_adult_female_count = get_count(second_class_adult_female_passengers)
second_class_elderly_count = get_count(second_class_elderly_passengers)
second_class_elderly_male_count = get_count(second_class_elderly_male_passengers)
second_class_elderly_female_count = get_count(second_class_elderly_female_passengers)
third_class_count = get_count(third_class_passengers)
third_class_male_count = get_count(third_class_male_passengers)
third_class_female_count = get_count(third_class_female_passengers)
third_class_children_count = get_count(third_class_children_passengers)
third_class_children_male_count = get_count(third_class_children_male_passengers)
third_class_children_female_count = get_count(third_class_children_female_passengers)
third_class_adult_count = get_count(third_class_adult_passengers)
third_class_adult_male_count = get_count(third_class_adult_male_passengers)
third_class_adult_female_count = get_count(third_class_adult_female_passengers)
third_class_elderly_count = get_count(third_class_elderly_passengers)
third_class_elderly_male_count = get_count(third_class_elderly_male_passengers)
third_class_elderly_female_count = get_count(third_class_elderly_female_passengers)
#The PassengerId is present for all the passengers, but there are some columns that has some missing values. Which columns
#are they?
for column_id in titanic_all_data.columns.values:
column_non_NaN_count = titanic_all_data[column_id].count()
if column_non_NaN_count != sample_size:
print str(column_id) + ": missing " + str(sample_size - column_non_NaN_count) + " values"
I am going to leave the blank cells empty, until I need to fill them with a value - if needed. For the time being, I am not sure for example if I should fill the missing ages with the mean age or just treat them as zeros. Let us wait and see what need would arise.
Not all passengers have known ages. An assumption: that the passengers whose age is not know are equally distributed along all age range, so the effect of missing this piece of information is minimal. For curiosity, below is provided the percentages of survivors and victims whose ages are unknown:
#Get the count of both survivors and victims whose age is null
survivors_without_age_count = surviving_passengers['Age'].isnull().sum()
victims_without_age_count = victim_passengers['Age'].isnull().sum()
#Count the total of survivors and victims, so we can calculate the percentage of passengers with missing ages
total_survivors_count = get_count(surviving_passengers)
total_victims_count = get_count(victim_passengers)
print "Percentage of survivors with missing age: ", float(survivors_without_age_count)*100/total_survivors_count,"%"
print "Percentage of victims with missing age: ", float(victims_without_age_count)*100/total_victims_count,"%"
Some default columns, will be used as a dataframe column. These dataframes will be used to display the data in an HTML table format. Although there are alternatives to this, using the dataframe as a way to display tables was the easiest one.
#Rows to be displayed
rows = [{male_count, female_count, sample_size}]
#create the data frame, in preparation for HTML table display
df = pd.DataFrame(rows, columns=['Male', 'Female',"Total"], index=["count"])
#Display the dataframe as an HTML table. I had to use this function since just writing "df" on a single line
#will not display the table when there is a plot coming afterwards in the same cell.
display_html(df)
#Plot the pie chart, all passengers by gender
passengers_by_gender.size().plot.pie(label="Gender Ratio of All Passengers", autopct='%1.1f%%', colors=gender_colors)
plt.axis('equal')
#This command removes some floating numbers that appears
plt.tight_layout()
#I had a problem creating the dataframe in the same manner as the previous one, since the order was not preserved
#(ie Children count was below column elderly for example)
unspecified_age_count = get_count(passengers_by_generation.get_group("Unspecified age"))
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Children':children_count, 'Adult':adult_count,'Elderly':elderly_count,'Unspecified Age':unspecified_age_count},\
index=["count"])
#Display the table
display_html(df)
#Plot the pie chart of all the passengers, grouped by their generation
passengers_by_generation.size().plot.pie(label="Generation Ratio of All Passengers", autopct='%1.1f%%', startangle=90)
plt.axis('equal')
#This command removes some floating numbers that appears
plt.tight_layout()
The population pyramid of the passengers. Naturally, this excludes the passengers whose age was missing, so the total count here will not be equal to that of the whole sample.
#Age bins, from zero to 75 years old with 5 years increment
age_bins = range(0,75,5)
#Group both genders according to the age bin
grouped_female_ages = get_count(female_passengers.groupby( pd.cut( female_passengers["Age"], np.arange(0, 80, 5) ) ))
grouped_male_ages = get_count(male_passengers.groupby( pd.cut( male_passengers["Age"], np.arange(0, 80, 5) ) ))
#Get the highest count. Required so that both plots would keep the same scale when displayed
largest_x_value = max(grouped_female_ages.max(), grouped_male_ages.max())
#A helper function defined in the beginning of the file. Plots the age pyramid
plot_population_pyramid(age_bins, "Male Population", grouped_male_ages, "Female Population", grouped_female_ages, largest_x_value)
A noticeable mode is seen on the 20 years old bucket, in both genders.
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'First Class':first_class_count,\
'Second Class':second_class_count,\
'Third Class':third_class_count}, index=["count"])
#Display the table
display_html(df)
#Plot the pie chart of all the passengers, grouped by class
passengers_by_class.size().plot.pie(label="Passengers, by Class", autopct='%1.1f%%', startangle=90)
plt.axis('equal')
#This command removes some floating numbers that appears
plt.tight_layout()
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'Survivors': get_count(surviving_passengers) ,\
'Victims': get_count(victim_passengers) }, index=["count"])
#Display the table
display_html(df)
#Plot the pie chart of all the passengers, grouped by class
#I am not goin to use the custom survival pie chart because I want to make the start angle = 90 degrees
passengers_by_survival.size().plot.pie(label="Passengers, by Survival", \
autopct='%1.1f%%', \
labels=['Victim', 'Survived'], \
startangle=90,\
colors=survival_colors)
plt.axis('equal')
#This command removes some floating numbers that appears
plt.tight_layout()
#Create the dataframe in preparation for HTML table display
df = pd.DataFrame({'In Group':get_count(surviving_passengers),\
'Solo ': get_count(victim_passengers) }, index=["count"])
#Display the table
display_html(df)
#Plot the pie chart of all the passengers, grouped by class
get_count(titanic_all_data.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
label = "Companionship", \
autopct='%1.1f%%', \
startangle=90)
plt.axis('equal')
#This command removes some floating numbers that appears
plt.tight_layout()
#Compute the unspecified age parameters.
unspecified_age_by_gender = passengers_by_generation.get_group("Unspecified age").groupby('Sex')
unspecified_age_male_count = get_count(unspecified_age_by_gender.get_group('male'))
unspecified_age_female_count = get_count(unspecified_age_by_gender.get_group('female'))
#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Children':[children_male_count, children_female_count], \
'Adult':[adult_male_count, adult_female_count],\
'Elderly':[elderly_male_count, elderly_female_count],\
'Unspecified Age':[unspecified_age_male_count, unspecified_age_female_count]}, \
index=["Male Count", "Female Count"])
#Display the dataframe as an HTML table
display_html(df)
#STart a multiplot plot
fig = plt.figure(figsize=(8,12))
#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Sex'), 411, 'Children', 'GENDER_GRAPH')
#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Sex'), 412, 'Adults', 'GENDER_GRAPH')
#Elderly gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Sex'), 413, 'Elderly', 'GENDER_GRAPH')
#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Sex'), 414, 'Unspecified Age', 'GENDER_GRAPH')
plt.tight_layout()
Only the children passengers seemed to have a balanced proportion of both gender, all the rest have a higher proportion of men.
#Group each generation by survival.
children_by_survival = children_passengers.groupby('Survived')
adult_by_survival = adult_passengers.groupby('Survived')
elderly_by_survival = elderly_passengers.groupby('Survived')
unspecified_age_by_survival = passengers_by_generation.get_group("Unspecified age").groupby('Survived')
#Get the count of each group's survival
children_survived = get_count(children_by_survival.get_group(1))
children_victim = get_count(children_by_survival.get_group(0))
adult_survived = get_count(adult_by_survival.get_group(1))
adult_victim = get_count(adult_by_survival.get_group(0))
elderly_survived = get_count(elderly_by_survival.get_group(1))
elderly_victim = get_count(elderly_by_survival.get_group(0))
unspecified_age_survived = get_count(unspecified_age_by_survival.get_group(1))
unspecified_age_victim = get_count(unspecified_age_by_survival.get_group(0))
#Create the dataframe of Children, Adult, Elderly and Unspecified age's survival make up.
df = pd.DataFrame({'Children':[children_survived, children_victim], \
'Adult':[adult_survived, adult_victim],\
'Elderly':[elderly_survived, elderly_victim],\
'Unspecified Age':[unspecified_age_survived, unspecified_age_victim]}, \
index=["Survivors Count", "Victims Count"])
#Display the dataframe as an HTML table
display_html(df)
#Start a multiplot plot
fig = plt.figure(figsize=(8,12))
#Children gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Child").groupby('Survived'), 411, 'Children', 'SURVIVAL_GRAPH')
#Adults gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Adult").groupby('Survived'), 412, 'Adults', 'SURVIVAL_GRAPH')
#Elderly gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Elderly").groupby('Survived'), 413, 'Elderly', 'SURVIVAL_GRAPH')
#Unspecified age gender ratio pie plot
draw_pie_subplot( passengers_by_generation.get_group("Unspecified age").groupby('Survived'), 414, 'Unspecified', 'SURVIVAL_GRAPH')
plt.tight_layout()
#####################
#Children
children_male_passengers = children_passengers[ children_passengers['Sex'] == 'male' ]
children_female_passengers = children_passengers[ children_passengers['Sex'] == 'female' ]
children_male_by_survival = children_male_passengers.groupby('Survived')
children_male_survived = get_count(children_male_by_survival.get_group(1))
children_male_victim = get_count(children_male_by_survival.get_group(0))
children_female_by_survival = children_female_passengers.groupby('Survived')
children_female_survived = get_count(children_female_by_survival.get_group(1))
children_female_victim = get_count(children_female_by_survival.get_group(0))
#####################
#Adults
adults_male_passengers = adult_passengers[ adult_passengers['Sex'] == 'male' ]
adults_female_passengers = adult_passengers[ adult_passengers['Sex'] == 'female' ]
adults_male_by_survival = adults_male_passengers.groupby('Survived')
adults_male_survived = get_count(adults_male_by_survival.get_group(1))
adults_male_victim = get_count(adults_male_by_survival.get_group(0))
adults_female_by_survival = adults_female_passengers.groupby('Survived')
adults_female_survived = get_count(adults_female_by_survival.get_group(1))
adults_female_victim = get_count(adults_female_by_survival.get_group(0))
#####################
#Elderly
elderly_male_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'male' ]
elderly_female_passengers = elderly_passengers[ elderly_passengers['Sex'] == 'female' ]
elderly_male_by_survival = elderly_male_passengers.groupby('Survived')
elderly_male_survived = get_count(elderly_male_by_survival.get_group(1))
elderly_male_victim = get_count(elderly_male_by_survival.get_group(0))
elderly_female_by_survival = elderly_female_passengers.groupby('Survived')
elderly_female_survived = get_count(elderly_female_by_survival.get_group(1))
elderly_female_victim = get_count(elderly_female_by_survival.get_group(0))
#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up.
df = pd.DataFrame({'Male Children': [children_male_survived, children_male_victim], \
'Female Children': [children_female_survived, children_female_victim],\
'Male Adults': [adults_male_survived, adults_male_victim],\
'Female Adults': [adults_female_survived, adults_female_victim],\
'Male Elderly': [elderly_male_survived, elderly_male_victim],\
'Female Elderly': [elderly_female_survived, elderly_female_victim] }, index = ["Survived", "Victim"])
#Display the dataframe as an HTML table
display_html(df)
#Start a multiplot plot
fig = plt.figure(figsize=(15,12))
#Children pie plot
## Male children
draw_pie_subplot( children_male_by_survival, 321, 'Children Male', 'SURVIVAL_GRAPH')
##Female Children
draw_pie_subplot( children_female_by_survival, 322, 'Children Female', 'SURVIVAL_GRAPH')
#Adults pie plot
##Male Adults
draw_pie_subplot( adults_male_by_survival, 323, 'Adults Male', 'SURVIVAL_GRAPH')
##Female Adults
draw_pie_subplot( adults_female_by_survival, 324, 'Adults Female', 'SURVIVAL_GRAPH')
#Elderly pie plot
##Male Elderly
draw_pie_subplot( elderly_male_by_survival, 325, 'Elderly Males', 'SURVIVAL_GRAPH')
##Female Elderly
draw_pie_subplot( elderly_female_by_survival, 326, 'Elderly Females', 'SURVIVAL_GRAPH')
plt.tight_layout()
I find it interesting that even male children had a noticeably lower survival rate than female children.
#The code here follows the same logic for drawing the population pyramid for all the population above.
#But here, we are going to draw three graphs, one for each class
age_bins = range(0,75,5)
first_class_grouped_female_ages = get_count(first_class_female_passengers.groupby( pd.cut( first_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
first_class_grouped_male_ages = get_count(first_class_male_passengers.groupby( pd.cut( first_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))
second_class_grouped_female_ages = get_count(second_class_female_passengers.groupby( pd.cut( second_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
second_class_grouped_male_ages = get_count(second_class_male_passengers.groupby( pd.cut( second_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))
third_class_grouped_female_ages = get_count(third_class_female_passengers.groupby( pd.cut( third_class_female_passengers["Age"], np.arange(0, 80, 5) ) ))
third_class_grouped_male_ages = get_count(third_class_male_passengers.groupby( pd.cut( third_class_male_passengers["Age"], np.arange(0, 80, 5) ) ))
fig = plt.figure(figsize=(7,7))
first_max_x = max(first_class_grouped_female_ages.max(), first_class_grouped_male_ages.max())
second_max_x = max(second_class_grouped_female_ages.max(), second_class_grouped_male_ages.max())
third_max_x = max(third_class_grouped_female_ages.max(), third_class_grouped_male_ages.max())
largest_x_value = max( [first_max_x, second_max_x, third_max_x] )
#Draw the population pyramid for the first class
fig.add_subplot(311)
plot_population_pyramid(age_bins, \
"1st classs Male Population", \
first_class_grouped_male_ages, \
"1st class Female Population", \
first_class_grouped_female_ages, \
largest_x_value)
#Draw the population pyramid for the second class
fig.add_subplot(312)
plot_population_pyramid(age_bins, \
"2nd class Male Population", \
second_class_grouped_male_ages, \
"2nd class Female Population", \
second_class_grouped_female_ages, \
largest_x_value)
#Draw the population pyramid for the third class
fig.add_subplot(313)
plot_population_pyramid(age_bins, \
"3rd class Male Population", \
third_class_grouped_male_ages, \
"3rd class Female Population", \
third_class_grouped_female_ages, \
largest_x_value)
The third class' population pyramid is very skewed towards the male, with the largest spikes occur in between 15 to 30 years old.
#Create the dataframe of each class' gender make up.
df = pd.DataFrame({'First Class': [first_class_male_count, first_class_female_count], \
'Second Class': [second_class_male_count, second_class_female_count],\
'Third Class': [third_class_male_count, third_class_female_count]},\
index = ["Male", "Female"])
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(10,7))
#Plot the first class
draw_pie_subplot( first_class_passengers.groupby('Sex'), 131, 'First Class', 'GENDER_GRAPH')
#Plot the second class
draw_pie_subplot( second_class_passengers.groupby('Sex'), 132, 'Second Class', 'GENDER_GRAPH')
#Plot the third class
draw_pie_subplot( third_class_passengers.groupby('Sex'), 133, 'Third Class', 'GENDER_GRAPH')
plt.tight_layout()
first_class_unspecified_age = first_class_count - (first_class_children_count + first_class_adult_count + first_class_elderly_count)
second_class_unspecified_age = second_class_count - (second_class_children_count + second_class_adult_count + second_class_elderly_count)
third_class_unspecified_age = third_class_count - (third_class_children_count + third_class_adult_count + third_class_elderly_count)
#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Children': [first_class_children_count, second_class_children_count, third_class_children_count], \
'Adults': [first_class_adult_count, second_class_adult_count, third_class_adult_count],\
'Elderly': [first_class_elderly_count, second_class_elderly_count, third_class_elderly_count],\
'Unspecified Age':[first_class_unspecified_age, second_class_unspecified_age, third_class_unspecified_age]},\
index = ["First Class", "Second Class", "Third Class"])
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(9,12))
#First Class
draw_pie_subplot(first_class_passengers.groupby('Generation'), 311, 'First Class by Generation', 'DEFAULT')
#Second Class
draw_pie_subplot(second_class_passengers.groupby('Generation'), 312, 'Second Class by Generation', 'DEFAULT')
#Third Class
draw_pie_subplot(third_class_passengers.groupby('Generation'), 313, 'Third Class by Generation', 'DEFAULT')
plt.tight_layout()
It might be worth noting that the majority of the passengers of unspecified age come from the third class.
#Separate each class passengers by survival
first_class_by_survival = first_class_passengers.groupby("Survived")
second_class_by_survival = second_class_passengers.groupby("Survived")
third_class_by_survival = third_class_passengers.groupby("Survived")
#Get the count of both survivors and victims, by each class
first_class_surviving_count = get_count(first_class_by_survival.get_group(1))
first_class_victims_count = get_count(first_class_by_survival.get_group(0))
second_class_surviving_count = get_count(second_class_by_survival.get_group(1))
second_class_victims_count = get_count(second_class_by_survival.get_group(0))
third_class_surviving_count = get_count(third_class_by_survival.get_group(1))
third_class_victims_count = get_count(third_class_by_survival.get_group(0))
#Create the dataframe of Children, Adult, Elderly and Unspecified age's gender make up, by class.
df = pd.DataFrame({'Survived': [first_class_surviving_count, second_class_surviving_count, third_class_surviving_count], \
'Died': [first_class_victims_count, second_class_victims_count, third_class_victims_count]},\
index = ["First Class", "Second Class", "Third Class"])
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(9,12))
#First Class
draw_pie_subplot( first_class_passengers.groupby('Survived'), 311, 'First Class by Survival', 'SURVIVAL_GRAPH')
#Second Class
draw_pie_subplot( second_class_passengers.groupby('Survived') , 312, 'Second Class by Survival', 'SURVIVAL_GRAPH')
#Third Class
draw_pie_subplot( third_class_passengers.groupby('Survived') , 313, 'Third Class by Survival', 'SURVIVAL_GRAPH')
plt.tight_layout()
Just visually, one can see that the better the class was, the higher the chances of survival were. But this is just superficially, from the pie charts. Other factors may be involved as well, like the total number of passengers and the gender makeup of each class (The percentage of males in the third class was much higher than the other two classes).
#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_children_by_survival = first_class_children_passengers.groupby("Survived")
first_class_children_male_by_survival = first_class_children_male_passengers.groupby("Survived")
first_class_children_female_by_survival = first_class_children_female_passengers.groupby("Survived")
#####Second Class
second_class_children_by_survival = second_class_children_passengers.groupby("Survived")
second_class_children_male_by_survival = second_class_children_male_passengers.groupby("Survived")
second_class_children_female_by_survival = second_class_children_female_passengers.groupby("Survived")
#####Third Class
third_class_children_by_survival = third_class_children_passengers.groupby("Survived")
third_class_children_male_by_survival = third_class_children_male_passengers.groupby("Survived")
third_class_children_female_by_survival = third_class_children_female_passengers.groupby("Survived")
#Get the count of each category created in the cell above
first_class_children_surviving_count = get_surviving_count( first_class_children_by_survival)
first_class_children_victims_count = get_victim_count(first_class_children_by_survival)
first_class_children_male_surviving_count = get_surviving_count( first_class_children_male_by_survival)
first_class_children_male_victims_count = get_victim_count(first_class_children_male_by_survival)
first_class_children_female_surviving_count = get_surviving_count(first_class_children_female_by_survival)
first_class_children_female_victims_count = get_victim_count(first_class_children_female_by_survival)
second_class_children_surviving_count = get_surviving_count( second_class_children_by_survival)
second_class_children_victims_count = get_victim_count(second_class_children_by_survival)
second_class_children_male_surviving_count = get_surviving_count( second_class_children_male_by_survival)
second_class_children_male_victims_count = get_victim_count(second_class_children_male_by_survival)
second_class_children_female_surviving_count = get_surviving_count(second_class_children_female_by_survival)
second_class_children_female_victims_count = get_victim_count(second_class_children_female_by_survival)
third_class_children_surviving_count = get_surviving_count( third_class_children_by_survival)
third_class_children_victims_count = get_victim_count(third_class_children_by_survival)
third_class_children_male_surviving_count = get_surviving_count( third_class_children_male_by_survival)
third_class_children_male_victims_count = get_victim_count(third_class_children_male_by_survival)
third_class_children_female_surviving_count = get_surviving_count(third_class_children_female_by_survival)
third_class_children_female_victims_count = get_victim_count(third_class_children_female_by_survival)
#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'Survived': [first_class_children_surviving_count, second_class_children_surviving_count, third_class_children_surviving_count], \
'Died': [first_class_children_victims_count, second_class_children_victims_count, third_class_children_victims_count]},\
index = ["First Class", "Second Class", "Third Class"])
#Display the dataframe as an HTML table
display_html(df)
#Create the dataframe of the classes' children by survival
df = pd.DataFrame({'First Class': [first_class_children_male_surviving_count, \
first_class_children_male_victims_count, \
first_class_children_female_surviving_count, \
first_class_children_female_victims_count], \
'Second Class': [second_class_children_male_surviving_count, \
second_class_children_male_victims_count, \
second_class_children_female_surviving_count, \
second_class_children_female_victims_count],\
'Third Class': [third_class_children_male_surviving_count, \
third_class_children_male_victims_count, \
third_class_children_female_surviving_count, \
third_class_children_female_victims_count]},\
index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(15,12))
#First Class
draw_pie_subplot( first_class_children_by_survival , 331, 'First Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_male_by_survival , 332, 'First Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_children_female_by_survival , 333, 'First Class Female Children', 'SURVIVAL_GRAPH')
#Second Class
draw_pie_subplot( second_class_children_by_survival , 334, 'Second Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_male_by_survival , 335, 'Second Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_children_female_by_survival , 336, 'Second Class Female Children', 'SURVIVAL_GRAPH')
#Third Class
draw_pie_subplot( third_class_children_by_survival , 337, 'Third Class Children, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_male_by_survival , 338, 'Third Class Male Children', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_children_female_by_survival , 339, 'Third Class Female Children', 'SURVIVAL_GRAPH')
plt.tight_layout()
children_by_class = children_passengers.groupby('Pclass')
children_survived_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
children_drowned_by_class = children_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')
fig = plt.figure(figsize=(9,5))
draw_pie_subplot( children_survived_by_class, 121, 'Surviving Children By Class', 'DEFAULT')
draw_pie_subplot( children_drowned_by_class, 122, 'Victim Children By Class', 'DEFAULT')
plt.tight_layout()
#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_adult_by_survival = first_class_adult_passengers.groupby("Survived")
first_class_adult_male_by_survival = first_class_adult_male_passengers.groupby("Survived")
first_class_adult_female_by_survival = first_class_adult_female_passengers.groupby("Survived")
#####Second Class
second_class_adult_by_survival = second_class_adult_passengers.groupby("Survived")
second_class_adult_male_by_survival = second_class_adult_male_passengers.groupby("Survived")
second_class_adult_female_by_survival = second_class_adult_female_passengers.groupby("Survived")
#####Third Class
third_class_adult_by_survival = third_class_adult_passengers.groupby("Survived")
third_class_adult_male_by_survival = third_class_adult_male_passengers.groupby("Survived")
third_class_adult_female_by_survival = third_class_adult_female_passengers.groupby("Survived")
#Get the count of each category created in the cell above
first_class_adult_surviving_count = get_surviving_count( first_class_adult_by_survival)
first_class_adult_victims_count = get_victim_count(first_class_adult_by_survival)
first_class_adult_male_surviving_count = get_surviving_count( first_class_adult_male_by_survival)
first_class_adult_male_victims_count = get_victim_count(first_class_adult_male_by_survival)
first_class_adult_female_surviving_count = get_surviving_count(first_class_adult_female_by_survival)
first_class_adult_female_victims_count = get_victim_count(first_class_adult_female_by_survival)
second_class_adult_surviving_count = get_surviving_count( second_class_adult_by_survival)
second_class_adult_victims_count = get_victim_count(second_class_adult_by_survival)
second_class_adult_male_surviving_count = get_surviving_count( second_class_adult_male_by_survival)
second_class_adult_male_victims_count = get_victim_count(second_class_adult_male_by_survival)
second_class_adult_female_surviving_count = get_surviving_count(second_class_adult_female_by_survival)
second_class_adult_female_victims_count = get_victim_count(second_class_adult_female_by_survival)
third_class_adult_surviving_count = get_surviving_count( third_class_adult_by_survival)
third_class_adult_victims_count = get_victim_count(third_class_adult_by_survival)
third_class_adult_male_surviving_count = get_surviving_count( third_class_adult_male_by_survival)
third_class_adult_male_victims_count = get_victim_count(third_class_adult_male_by_survival)
third_class_adult_female_surviving_count = get_surviving_count(third_class_adult_female_by_survival)
third_class_adult_female_victims_count = get_victim_count(third_class_adult_female_by_survival)
#Create the dataframe of the classes' adult by survival
df = pd.DataFrame({'Survived': [first_class_adult_surviving_count, second_class_adult_surviving_count, third_class_adult_surviving_count], \
'Died': [first_class_adult_victims_count, second_class_adult_victims_count, third_class_adult_victims_count]},\
index = ["First Class", "Second Class", "Third Class"])
#Display the dataframe as an HTML table
display_html(df)
#Create the dataframe of the classes' adult by survival
df = pd.DataFrame({'First Class': [first_class_adult_male_surviving_count, \
first_class_adult_male_victims_count, \
first_class_adult_female_surviving_count, \
first_class_adult_female_victims_count], \
'Second Class': [second_class_adult_male_surviving_count, \
second_class_adult_male_victims_count, \
second_class_adult_female_surviving_count, \
second_class_adult_female_victims_count],\
'Third Class': [third_class_adult_male_surviving_count, \
third_class_adult_male_victims_count, \
third_class_adult_female_surviving_count, \
third_class_adult_female_victims_count]},\
index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(15,12))
#First Class
draw_pie_subplot( first_class_adult_by_survival , 331, 'First Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_adult_male_by_survival , 332, 'First Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_adult_female_by_survival , 333, 'First Class Female Adult', 'SURVIVAL_GRAPH')
#Second Class
draw_pie_subplot( second_class_adult_by_survival , 334, 'Second Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_adult_male_by_survival , 335, 'Second Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_adult_female_by_survival , 336, 'Second Class Female Adult', 'SURVIVAL_GRAPH')
#Third Class
draw_pie_subplot( third_class_adult_by_survival , 337, 'Third Class Adult, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_adult_male_by_survival , 338, 'Third Class Male Adult', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_adult_female_by_survival , 339, 'Third Class Female Adult', 'SURVIVAL_GRAPH')
plt.tight_layout()
Second Class males, to my surprise, had the toughest luck. I was expecting the third class males to have the toughest one.
adult_by_class = adult_passengers.groupby('Pclass')
adult_survived_by_class = adult_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
adult_drowned_by_class = adult_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')
fig = plt.figure(figsize=(9,5))
draw_pie_subplot( adult_survived_by_class, 121, 'Surviving Adult By Class', 'DEFAULT')
draw_pie_subplot( adult_drowned_by_class, 122, 'Victim Adult By Class', 'DEFAULT')
plt.tight_layout()
#First, group each class by generation survival, AND by (generation and gender)'s survival
#####First Class
first_class_elderly_by_survival = first_class_elderly_passengers.groupby("Survived")
first_class_elderly_male_by_survival = first_class_elderly_male_passengers.groupby("Survived")
first_class_elderly_female_by_survival = first_class_elderly_female_passengers.groupby("Survived")
#####Second Class
second_class_elderly_by_survival = second_class_elderly_passengers.groupby("Survived")
second_class_elderly_male_by_survival = second_class_elderly_male_passengers.groupby("Survived")
second_class_elderly_female_by_survival = second_class_elderly_female_passengers.groupby("Survived")
#####Third Class
third_class_elderly_by_survival = third_class_elderly_passengers.groupby("Survived")
third_class_elderly_male_by_survival = third_class_elderly_male_passengers.groupby("Survived")
third_class_elderly_female_by_survival = third_class_elderly_female_passengers.groupby("Survived")
#Get the count of each category created in the cell above
first_class_elderly_surviving_count = get_surviving_count( first_class_elderly_by_survival)
first_class_elderly_victims_count = get_victim_count(first_class_elderly_by_survival)
first_class_elderly_male_surviving_count = get_surviving_count( first_class_elderly_male_by_survival)
first_class_elderly_male_victims_count = get_victim_count(first_class_elderly_male_by_survival)
first_class_elderly_female_surviving_count = get_surviving_count(first_class_elderly_female_by_survival)
first_class_elderly_female_victims_count = get_victim_count(first_class_elderly_female_by_survival)
second_class_elderly_surviving_count = get_surviving_count( second_class_elderly_by_survival)
second_class_elderly_victims_count = get_victim_count(second_class_elderly_by_survival)
second_class_elderly_male_surviving_count = get_surviving_count( second_class_elderly_male_by_survival)
second_class_elderly_male_victims_count = get_victim_count(second_class_elderly_male_by_survival)
second_class_elderly_female_surviving_count = get_surviving_count(second_class_elderly_female_by_survival)
second_class_elderly_female_victims_count = get_victim_count(second_class_elderly_female_by_survival)
third_class_elderly_surviving_count = get_surviving_count( third_class_elderly_by_survival)
third_class_elderly_victims_count = get_victim_count(third_class_elderly_by_survival)
third_class_elderly_male_surviving_count = get_surviving_count( third_class_elderly_male_by_survival)
third_class_elderly_male_victims_count = get_victim_count(third_class_elderly_male_by_survival)
third_class_elderly_female_surviving_count = get_surviving_count(third_class_elderly_female_by_survival)
third_class_elderly_female_victims_count = get_victim_count(third_class_elderly_female_by_survival)
#Create the dataframe of the classes' elderly by survival
df = pd.DataFrame({'Survived': [first_class_elderly_surviving_count, second_class_elderly_surviving_count, third_class_elderly_surviving_count], \
'Died': [first_class_elderly_victims_count, second_class_elderly_victims_count, third_class_elderly_victims_count]},\
index = ["First Class", "Second Class", "Third Class"])
#Display the dataframe as an HTML table
display_html(df)
#Create the dataframe of the classes' elderly by survival
df = pd.DataFrame({'First Class': [first_class_elderly_male_surviving_count, \
first_class_elderly_male_victims_count, \
first_class_elderly_female_surviving_count, \
first_class_elderly_female_victims_count], \
'Second Class': [second_class_elderly_male_surviving_count, \
second_class_elderly_male_victims_count, \
second_class_elderly_female_surviving_count, \
second_class_elderly_female_victims_count],\
'Third Class': [third_class_elderly_male_surviving_count, \
third_class_elderly_male_victims_count, \
third_class_elderly_female_surviving_count, \
third_class_elderly_female_victims_count]},\
index = ["Male Survivor", "Male Victim", "Female Survivor", "Female Victim"] )
#Display the dataframe as an HTML table
display_html(df)
fig = plt.figure(figsize=(15,12))
#First Class
draw_pie_subplot( first_class_elderly_by_survival , 331, 'First Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_elderly_male_by_survival , 332, 'First Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( first_class_elderly_female_by_survival , 333, 'First Class Female Elderly', 'SURVIVAL_GRAPH')
#Second Class
draw_pie_subplot( second_class_elderly_by_survival , 334, 'Second Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_elderly_male_by_survival , 335, 'Second Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( second_class_elderly_female_by_survival , 336, 'Second Class Female Elderly', 'SURVIVAL_GRAPH')
#Third Class
draw_pie_subplot( third_class_elderly_by_survival , 337, 'Third Class Elderly, Total', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_elderly_male_by_survival , 338, 'Third Class Male Elderly', 'SURVIVAL_GRAPH')
draw_pie_subplot( third_class_elderly_female_by_survival , 339, 'Third Class Female Elderly', 'SURVIVAL_GRAPH')
plt.tight_layout()
elderly_by_class = elderly_passengers.groupby('Pclass')
elderly_survived_by_class = elderly_by_class.apply(lambda t : t[t['Survived'] == 1]).groupby('Pclass')
elderly_drowned_by_class = elderly_by_class.apply(lambda t : t[t['Survived'] == 0]).groupby('Pclass')
fig = plt.figure(figsize=(9,5))
draw_pie_subplot( elderly_survived_by_class, 121, 'Victim Elderly By Class', 'DEFAULT')
draw_pie_subplot( elderly_drowned_by_class, 122, 'Victim Elderly By Class', 'DEFAULT')
plt.tight_layout()
Because of the tragedy, I feel that we should look again in the survival of the individual. Some of the data that will be presented here can be redundunt with what have been presented above, already. But I think we should still re-examine the survival from angles.
age_bins = range(0,75,5)
survivors_ages = get_count(surviving_passengers.groupby( pd.cut( surviving_passengers["Age"], np.arange(0, 80, 5) ) ))
victims_ages = get_count(victim_passengers.groupby( pd.cut( victim_passengers["Age"], np.arange(0, 80, 5) ) ))
largest_x_value = max(survivors_ages.max(), victims_ages.max() )
plot_population_pyramid(age_bins, "Survivors", survivors_ages, "Victims", victims_ages, largest_x_value)
surviving_males = surviving_passengers[ surviving_passengers['Sex'] == 'male']
victim_males = victim_passengers[ victim_passengers['Sex'] == 'male']
age_bins = range(0,75,5)
surviving_males_ages = get_count(surviving_males.groupby( pd.cut( surviving_males["Age"], np.arange(0, 80, 5) ) ))
victim_males_ages = get_count(victim_males.groupby( pd.cut( victim_males["Age"], np.arange(0, 80, 5) ) ))
largest_x_value = max(surviving_males_ages.max(), victim_males_ages.max() )
plot_population_pyramid(age_bins, "Surviving Males", surviving_males_ages, "Victim Males", victim_males_ages, largest_x_value)
surviving_females = surviving_passengers[ surviving_passengers['Sex'] == 'female']
victim_females = victim_passengers[ victim_passengers['Sex'] == 'female']
age_bins = range(0,75,5)
surviving_females_ages = get_count(surviving_females.groupby( pd.cut( surviving_females["Age"], np.arange(0, 80, 5) ) ))
victim_females_ages = get_count(victim_females.groupby( pd.cut( victim_females["Age"], np.arange(0, 80, 5) ) ))
largest_x_value = max(surviving_females_ages.max(), victim_females_ages.max() )
plot_population_pyramid(age_bins, \
"Surviving Females", \
surviving_females_ages, \
"Victim Females", \
victim_females_ages, \
largest_x_value)
fig = plt.figure()
fig.add_subplot(121)
titanic_all_data['Survived'].groupby(titanic_all_data['Pclass']).sum().plot.pie(label = 'Total Passengers Saved, by Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(122)
titanic_all_data['Survived'].groupby(titanic_all_data['Pclass']).sum().plot.bar(label = 'Total Passengers Saved, by Class')
plt.tight_layout()
It looks much more fair than what I have expected. Of course one can argue here that, the majority of women and children of the upper classes were already saved, and this is why the pie chart looks evenly distributed; but I would still say that it is not as bad as I thought before I have plot these graphs.
fig = plt.figure(figsize=(9,5))
fig.add_subplot(121)
titanic_all_data['Survived'].groupby(titanic_all_data['Generation']).sum().plot.pie(label = 'Total Passengers Saved, by Generation',\
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(122)
titanic_all_data['Survived'].groupby(titanic_all_data['Generation']).sum().plot.bar(label = 'Total Passengers Saved, by Generation')
plt.tight_layout()
fig = plt.figure(figsize=(9,5))
fig.add_subplot(121)
victim_passengers['Survived'].groupby(victim_passengers['Generation']).count().plot.pie(label = 'Victims, by Generation', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(122)
victim_passengers['Survived'].groupby(victim_passengers['Generation']).count().plot.bar(label = 'Victims, by Generation')
plt.tight_layout()
fig = plt.figure(figsize=(15,6))
fig.add_subplot(131)
#First class plot
first_class_survivors['Survived'].groupby(first_class_survivors['Generation']).count().plot.pie(label = 'First Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(132)
#Second class plot
second_class_survivors['Survived'].groupby(second_class_survivors['Generation']).count().plot.pie(label = 'Second Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(133)
#Third class plot
third_class_survivors['Survived'].groupby(third_class_survivors['Generation']).count().plot.pie(label = 'Third Class', \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
fig = plt.figure(figsize=(15,6))
fig.add_subplot(131)
#First class plot
first_class_victims['Survived'].groupby(first_class_victims['Generation']).count().plot.pie(label = 'First Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(132)
#Second class plot
second_class_victims['Survived'].groupby(second_class_victims['Generation']).count().plot.pie(label = 'Second Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(133)
#Third class plot
third_class_victims['Survived'].groupby(third_class_victims['Generation']).count().plot.pie(label = 'Third Class', \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
Now, after gaining some intuition about the data, one final parameter is left to explore so we can start answering our first question: companionship.
solo_passengers_count = get_count(passengers_by_companionship.get_group(True))
group_passengers_count = get_count(passengers_by_companionship.get_group(False))
df = pd.DataFrame({'Count': [solo_passengers_count, group_passengers_count]}, index = ['Solo', 'Group'])
display_html(df)
get_count(titanic_all_data.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
label = "Companionship", \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
group_travelers_by_class = passengers_by_class.apply(lambda t : t[t['isSolo'] == False]).groupby('Pclass')
solo_travelers_by_class = passengers_by_class.apply(lambda t : t[t['isSolo'] == True]).groupby('Pclass')
fig = plt.figure(figsize=(9,5))
draw_pie_subplot( group_travelers_by_class, 121, 'Group Travelers By Class', 'DEFAULT')
draw_pie_subplot( solo_travelers_by_class, 122, 'Solo Travelers By Class', 'DEFAULT')
plt.tight_layout()
solo_passengers_count = get_count(passengers_by_companionship.get_group(True))
group_passengers_count = get_count(passengers_by_companionship.get_group(False))
first_class_solo_passengers_count = get_count(first_class_passengers[ first_class_passengers['isSolo'] == True])
first_class_group_passengers_count = get_count(first_class_passengers[ first_class_passengers['isSolo'] == False])
second_class_solo_passengers_count = get_count(second_class_passengers[ second_class_passengers['isSolo'] == True])
second_class_group_passengers_count = get_count(second_class_passengers[ second_class_passengers['isSolo'] == False])
third_class_solo_passengers_count = get_count(third_class_passengers[ third_class_passengers['isSolo'] == True])
third_class_group_passengers_count = get_count(third_class_passengers[ third_class_passengers['isSolo'] == False])
df = pd.DataFrame({'First Class': [first_class_solo_passengers_count, first_class_group_passengers_count], \
'Second Class': [second_class_solo_passengers_count, second_class_group_passengers_count], \
'Third Class': [third_class_solo_passengers_count, third_class_group_passengers_count]}, \
index = ['Group', 'Solo'])
display_html(df)
fig = plt.figure(figsize=(12,5))
fig.add_subplot(131)
get_count(first_class_passengers.groupby('isSolo')).plot.pie(label = 'First Class', \
labels=['Group', 'Solo'],\
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(132)
get_count(second_class_passengers.groupby('isSolo')).plot.pie(label = 'Second Class', \
labels=['Group', 'Solo'],\
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(133)
get_count(third_class_passengers.groupby('isSolo')).plot.pie(label = 'Third Class', \
labels=['Group', 'Solo'],\
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
age_bins = range(16,76,4)
solo_female_passengers = female_passengers[ female_passengers['isSolo'] == True ]
solo_male_passengers = male_passengers[ male_passengers['isSolo'] == True ]
grouped_female_ages = get_count(solo_female_passengers.groupby( pd.cut( solo_female_passengers["Age"], np.arange(16, 80, 4) ) ))
grouped_male_ages = get_count(solo_male_passengers.groupby( pd.cut( solo_male_passengers["Age"], np.arange(16, 80, 4) ) ))
largest_x_value = max(grouped_female_ages.max(), grouped_male_ages.max())
plot_population_pyramid(age_bins, \
"Solo Male Population", \
grouped_male_ages, \
"Solo Female Population", \
grouped_female_ages, \
largest_x_value)
I got curious about solo women traveling on the ship, since the accident happened during a time when the culture was more conservative than nowadays.
get_count(passengers_by_gender.get_group('female').groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
label = " Women Companionship", \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
group_women_count = get_count(passengers_by_gender.get_group('female').groupby('isSolo'))[0]
solo_women_count = get_count(passengers_by_gender.get_group('female').groupby('isSolo'))[1]
print "Women traveling alone: " + str(solo_women_count) + " Percentage: " + str ( format((float(solo_women_count) / female_count )*100.0, '.2f')) + "%"
print "Women traveling in a group: "+ str(group_women_count) + " Percentage: " + str( format((float(group_women_count) / female_count )*100.0, '.2f')) + "%"
fig = plt.figure(figsize=(12,5))
fig.add_subplot(131)
get_count(first_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'First Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(132)
get_count(second_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'Second Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(133)
get_count(third_class_female_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'Third Class', \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
first_class_solo_women_passengers = solo_female_passengers[ solo_female_passengers['Pclass'] == 1 ]
second_class_solo_women_passengers = solo_female_passengers[ solo_female_passengers['Pclass'] == 2 ]
third_class_solo_women_passengers = solo_female_passengers[ solo_female_passengers['Pclass'] == 3 ]
fig = plt.figure(figsize=(7,5))
fig.add_subplot(121)
get_count(solo_female_passengers.groupby('Pclass')).plot.pie(label = 'Solo Women by Class',\
labels = ['1st Class', '2nd Class', '3rd Class'],\
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(122)
get_count(solo_female_passengers.groupby('Pclass')).plot.bar()
plt.tight_layout()
The class, visually, does not look to like it has an effect over the women's traveling companionship pattern.
Now, let us have a closer look at the male's traveling companionship; just to make the picture complete:
get_count(passengers_by_gender.get_group('male').groupby('isSolo')).plot.pie(labels=['Group', 'Solo'], \
label = " Men Companionship", \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
group_men_count = get_count(passengers_by_gender.get_group('male').groupby('isSolo'))[0]
solo_men_count = get_count(passengers_by_gender.get_group('male').groupby('isSolo'))[1]
print "Men traveling alone: " + str(solo_men_count) + " Percentage: " + str ( format((float(solo_men_count) / male_count )*100.0, '.2f')) + "%"
print "Men traveling in a group: "+ str(group_men_count) + " Percentage: " + str( format((float(group_men_count) / male_count )*100.0, '.2f')) + "%"
fig = plt.figure(figsize=(12,5))
fig.add_subplot(131)
get_count(first_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'First Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(132)
get_count(second_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'Second Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(133)
get_count(third_class_male_passengers.groupby('isSolo')).plot.pie(labels=['Group', 'Solo'],\
label = 'Third Class', \
autopct='%1.1f%%')
plt.axis('equal')
plt.tight_layout()
first_class_solo_men_passengers = solo_male_passengers[ solo_male_passengers['Pclass'] == 1 ]
second_class_solo_men_passengers = solo_male_passengers[ solo_male_passengers['Pclass'] == 2 ]
third_class_solo_men_passengers = solo_male_passengers[ solo_male_passengers['Pclass'] == 3 ]
fig = plt.figure()
fig.add_subplot(121)
get_count(solo_male_passengers.groupby('Pclass')).plot.pie(label = 'Solo Women by Class', \
autopct='%1.1f%%')
plt.axis('equal')
fig.add_subplot(122)
get_count(solo_male_passengers.groupby('Pclass')).plot.bar()
plt.tight_layout()
Now, down to the visualization of our first question: how does survival chance look like when seen through the lense of companionship?
fig = plt.figure(figsize=(9,6))
draw_pie_subplot( surviving_passengers.groupby(surviving_passengers['isSolo']) , 121, 'Solo Survival', 'SURVIVAL_GRAPH')
draw_pie_subplot( victim_passengers.groupby(victim_passengers['isSolo']) , 122, 'Group Survival', 'SURVIVAL_GRAPH')
plt.tight_layout()
titanic_all_data['TotalCompanions'].describe()
group_passengers = passengers_by_companionship.get_group(False);
solo_passengers = passengers_by_companionship.get_group(True)
group_passengers['TotalCompanions'].describe()
I think it is noteworthy to remind here that, the data has 0 as its minimum for the total number of companions (although we are examining the group passengers) is due to the choice of including children with the group travelers.
The statistics for solo travelers will not be performed, since all the data (Except to the count) is equal to zero
group_passengers['Survived'].describe()
solo_passengers['Survived'].describe()
The contingency table will be built so we can examine the relationship and perform the hypothesis testing
#Expected frequency: Total survival, total drowning
#Solo
group_passengers_survived_count = get_count(group_passengers.groupby('Survived').get_group(1))
group_passengers_victims_count = get_count(group_passengers.groupby('Survived').get_group(0))
#Group
solo_passengers_survived_count = get_count(solo_passengers.groupby('Survived').get_group(1))
solo_passengers_victims_count = get_count(solo_passengers.groupby('Survived').get_group(0))
# passengers_by_survival
survivors_count_list = [group_passengers_survived_count, solo_passengers_survived_count, total_survivors_count]
victims_count_list = [group_passengers_victims_count, solo_passengers_victims_count, total_victims_count]
total_count_list = [get_count(group_passengers), get_count(solo_passengers), sample_size]
#Had to pass the - redundunt - column argument to skip Pandas default ordering
contingency_table = pd.DataFrame({'Survived': survivors_count_list , \
'Victims': victims_count_list , \
'Total, Group': total_count_list},\
index = ['Group', 'Solo', 'Total, State'], columns = [ 'Survived', 'Victims', 'Total, Group'])
display_html(contingency_table)
Now, all is set to calculate the Χ2
chi_squared, p, degrees_of_freedom, expected_frequency = scipy.stats.chi2_contingency( contingency_table )
print "Chi Squared: ", chi_squared
print "p value: ", p
print "Degrees of Freedom", degrees_of_freedom
print "Expected Frequency for The Group Passengers:", expected_frequency[0]
print "Expected Frequency for The Solo Passengers:", expected_frequency[1]
The statistical significance here is very high, but I this can be affected by children: Probably there are a good portion of children traveling with their families, and that can lead to a higher than usual women and children presence in the group passengers, and lower than usual presence of women in the solo group (By our definition of a solo group, a child cannot be solo even if Parch and SibSp are equal to zero). Let us examine how much women and children are present in the groups.
women_and_children_group_passengers = titanic_all_data[ (titanic_all_data['Sex'] == 'female') | \
(titanic_all_data['Generation'] == 'Child')]
women_and_children_group_passengers_count = get_count(women_and_children_group_passengers)
print "Number of women and children in the group travelers: ", women_and_children_group_passengers_count
print "Percentage of women and children in the group traveler: ",\
(float(women_and_children_group_passengers_count)/ get_count(group_passengers))*100.0 , "%"
I was a bit shocked at first about such a high percentage. The percentage of grown up men(Males above 17 years old) constitute less than 2% of the total passengers that are traveling within their immediate family. But after a second thought, that can make sense because there could be only one grown up man within a family made of both parents and their children.
There could be other instances, of course, than a family made of both parents and children that can bring more grown up men to the count, like adult male brothers traveling together or a grown up son traveling with his elderly father
I will try to refine the hypothesis, by excluding children from the group data, and just compare grown up passengers (adults and elderly) from both groups.
Note: The two grownup variables do not have the members whose age is unknown, they contain only members that we know for sure that they are adults
passengers_by_generation = titanic_all_data.groupby('Generation')
#The output will be a dictionary, with keys Adult and Elderly
passengers_by_grown_up = {key: value for key, value in passengers_by_generation if key in ['Adult', 'Elderly']}
#Now, expand the values of the dictionary into a dataframe
grown_up_passengers = pd.concat([grown_up_passengers_values for keys, grown_up_passengers_values in passengers_by_grown_up.items()])
#Get the count of survival. This will be used as our expected count for the Chi square computation
grown_up_survival_count = get_count(grown_up_passengers.groupby('Survived'))
#The append function did not behave as I expected, ie it did not perform its action in place nor did it have an inplace parameter.
#I had to take the whole thing and make it equal to the Series in question
#Also, the index = [2] is done to prevent index duplicates, otherwise the total will be added with index 0. It would not really
#affect the calculation, but it affected when I wanted to rename the indices for readability
grown_up_survival_count = grown_up_survival_count.append( pd.Series(len(grown_up_passengers), index = [2] ))
#Separate passengers by companionship
group_grown_up_passengers = grown_up_passengers.groupby('isSolo').get_group(0)
solo_grown_up_passengers = grown_up_passengers.groupby('isSolo').get_group(1)
#Get the survival for each group and solo grown up passengers
group_grown_up_passengers_survival_count = get_count(group_grown_up_passengers.groupby('Survived'))
solo_grown_up_passengers_survival_count = get_count(solo_grown_up_passengers.groupby('Survived'))
#Append the total to each series
group_grown_up_passengers_survival_count = group_grown_up_passengers_survival_count.append( pd.Series(len(group_grown_up_passengers), index = [2] ))
solo_grown_up_passengers_survival_count = solo_grown_up_passengers_survival_count.append( pd.Series(len(solo_grown_up_passengers), index = [2] ))
#Build the contingency table
contingency_table = pd.concat([ group_grown_up_passengers_survival_count, solo_grown_up_passengers_survival_count, grown_up_survival_count],\
axis=1, \
keys ={'Group','Solo','Total'})
contingency_table.rename(index= { 0 : 'Victims', 1 : 'Survivors', 2 : 'Total'},\
inplace = True)
contingency_table
Next, it is the Chi square for the adult passengers, by companionship:
chi_squared, p, degrees_of_freedom, expected_frequency = scipy.stats.chi2_contingency( contingency_table )
print "Chi Squared: ", chi_squared
print "p value: ", p
print "Degrees of Freedom", degrees_of_freedom
print "Expected Frequency for The Group Passengers:", expected_frequency[0]
print "Expected Frequency for The Solo Passengers:", expected_frequency[1]
Well, the results were shifted towards the mean by orders of magnitude after removing the children, but still they are very statistically significant even under the most conservative standards. The chances to get such a sample is about 1 in 3000, very low indeed.
avergae_std = (group_passengers['Survived'].std() + solo_passengers['Survived'].std()) / 2
cohens_d = abs(solo_passengers['Survived'].mean() - group_passengers['Survived'].mean() )/avergae_std
print "Cohen d: ", cohens_d
Χ2 = 20.82, p < 0.001, two tailed
Effect Size Measures:
NB: I have based this way of writing up the conclusion from the book "Statistics in plain english"
A chi-square analysis was performed to determine whether traveling companionship affected the chances of survival for a passenger.The analysis produced a significant χ2 value (39.54, df = 4, p < .001), indicating that traveling with first degree family affected the chances of survival. The question was then refined, by removing the children passengers since they had the best survival chances, and were absent from the Solo passengers group because of how we defined who is a solo traveler. The new χ2 value remained significant (20.82, df = 4, p < .001) although less than the original question by orders of magnitude. Therefore, we must reject the null hypothesis.
The limitation I see in this analysis is the gender bias between the two groups. The majority of solo travelers are men (And, mostly from the third class), and that can be an alternative explanation for the significance. It might have been appropriate to explore the question more, by separating gender from each group and then compare each gender from each group to each other (ie Solo women vs Group women AND Solo men vs Group men), but unfortunately the count of adult men within the group traveler was too low to perform such a test. If the dataset was complete, such a test may have been feasible.
The limitation I have found is that, since the question under investigation involved categorizing passengers based over their age, missing ages meant that the test was not done all over the sample, but only a fraction of it. I have opted to omit passengers with missing ages rather than making any other assumptions (Like assuming their age is mean\median, or even assume that their age distribution is the same as the that of the bigger sample) since I had a good amount of data already to run the test; so there was no need to take any risks by making extra assumptions.
The data field that I wanted to have, although it can be hard to get - especially for adult passengers -, is who are they traveling with other than the immediate family. Cousins, friends..etc could have proved useful for such a question. The reason I have picked this question from the first place is that, I imagined what might have been the situation on deck during such a hard time, and I though that a group of people who care for each other can be useful in such an extreme situation: they can push, fight, beg or even something illegal like bribe to save all the rest of the group, a priviliged support that a solo traveler would not have. Of course the immediate family might be the most aggressive\protective, but still cousins and close friends can provide a comparable support.
The problem with this question is that this is only a partial data, so numbers will not add up: there could be other family members in the other set. But I will proceed anyway. So here are my assumptions:
group_family = group_passengers[ group_passengers['TotalCompanions'] > 0 ]
group_by_family_size = group_family.groupby(['TotalCompanions','Pclass', 'LastName'])#, sort=False)
#We are going to create three lists, one for all the families whose members have NOT survived
#Another for the families which had some members who survived but not all
#The third is for the families whose all members were able to make it
families_totally_perished = []
families_partially_saved = []
families_totally_saved = []
#Now, let's loop over the groupby of family by size created above, and check which family belongs
#to which of the three lists created above
#Loop over the group
#tc = TotalCompanions, pcls = pclass, lname = LastName. There was no need to unpack the keys, but I am going to leave it
#that way.
for (tc, pcls, lname), data in group_by_family_size:
n = 0
#A variable to count how many members survived within the same family
saved_count = 0
#Loop over each member within the same family
for current_family_member in range(len(data)):
#If the current member in the loop has survived, increment the saved count
if data.iloc[current_family_member]['Survived'] == 1:
saved_count += 1
#After looping over all family members, now let us check the count to decide in which of the 3 lists
#are we going to add the family to
if saved_count == 0:
families_totally_perished.append(data)
#If the total count is not equal to the total companions + 1. We add one here because the family size
# is always greater than the total companions by one. This is because the count always tells the number
#of other family members within the family of the passenger, but does not count the passenger himself\herself
elif (saved_count != (data.iloc[current_family_member]['TotalCompanions'] + 1) ):
families_partially_saved.append(data)
else:
families_totally_saved.append(data)
#A function that will display each list as a table. The three lists to display are:
#families_totally_perished
#families_partially_saved
#families_totally_saved
def display_family_list(family_list):
#Loop over each family within the passed list
for family in family_list:
#Create the container dataframe. THis data frame will hold temporarily all members that belong to the same family
#Later on, this dataframe will be used to display the family in a table
df = pd.DataFrame( columns=("Generation", 'Sex', 'Age', 'Name', 'Class','PassengerId'))
#Print the name of the family, and then print on a new line how many members the family has
print "The " + str(family.iloc[0]['LastName']) + " Family: "
print "The Family had " + str(family.iloc[0]['TotalCompanions'] + 1) + " members on board"
#Write how many family members from the current family were present in the data set
#If the number of family members within the set is equal to the total family size (Companions + 1),
#the print that all family members were present in the data set
if( (len(family)) == (family.iloc[0]['TotalCompanions'] + 1) ):
print "All of the family members were available in the dataset"
#Else, write how many family members were present in the data set
else:
#Just some nice formatting, divide the singular and plural
if(len(family) == 1):
print ' --> Only one member was available in the dataset'
else:
print " --> Only "+ str(len(family)) + " members were available in the dataset"
#Loop over each family member within the current family
for i in range(len(family)):
#Append the current family member to the dataframe, so that member would be displayed inside the same table.
df = df.append( {"Generation":family.iloc[i]['Generation'],\
'Sex':family.iloc[i]['Sex'], \
'Age':family.iloc[i]['Age'], \
'Name': family.iloc[i]['Name'],\
'Class': family.iloc[i]['Pclass'],\
'PassengerId': family.iloc[i]['PassengerId'] } , ignore_index=True )
#df = df.append([1,2,3,4],ignore_index=True)
display_html(df)
print '_________________________________________________________________'
display_family_list(families_totally_perished)
display_family_list(families_partially_saved)
display_family_list(families_totally_saved)
The dataset