Index of /teaching/dav_20/labs/lab11/

      Name                                                                             Last modified         Size  Description 
   
up Parent Directory 17-May-2020 12:23 - unknown test.csv 17-May-2020 12:23 28k unknown train.csv 17-May-2020 12:23 60k

====================================================================
                            TITANIC

====================================================================

Use the passenger data from Titanic shipwreck to answer question 
"what sorts of people were more likely to survive?”

You will be given: name, age, gender, socio-economic class, etc) 

====================================================================

The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)

Class description:

pclass: A proxy for socio-economic status (SES)
1st = Upper, 2nd = Middle, 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

1) Explore the data:
- make plots regarding each feature 
(e.g. scatter plots, histograms, box plots, heat map, etc.
minimum two types of plots for each feature
- decide which columns should be used for the prediction 
- calculate the correlation of each feature to the survival

hint: if a lot of values is missing than it sign to drop the whole column
- think about engineering your own features e.g. family size from sibsp & parch

2) We will start with decision tree as the most intuitive 
a) train DecisionTreeClassifier (test the depth parameter)
- calculate the accuracy for different tree structures (e.g. depth, number of features)*
- visualise the trees (use meaningful labels)
b) do the training again using RandomForestClassifier

* you can use GridSearchCV here


... to be continued next week 

Thus, this week no home work (yet)
Proudly Served by LiteSpeed Web Server at bioinformatics.netmark.pl Port 80