The importance of the (big) data

Exercise 1

Calculate the average protein length and amino acid 
content for different data sets:

a) E.coli, Bacillus subtilis, human, yeast,
A. thaliana, D. melanogaster, C. elegans, Mouse, 
Zebrafish (D. rerio)

b) PDB

c) UniProt
- full UniProt (Swiss-Prot)
- 200 randomly selected Bacteria
- 200 randomly selected Viruses
- 200 randomly selected Archaea
- 200 randomly selected Eukaryota

Make plots comparing average protein length between:
- selected organisms (a)
- all kingdoms (c)
- PDB vs Uniprot

For amino acid content you should calculate also some
error (e.g. standard deviation).

Moreover, check which amino acid is the most frequent 
at N-terminus. Can you justify why this one?

Additional material:

1) Finish the plots for avg protein content for UniProt.
2) Find in the internet any plot you wish to show at next 
lesson (think how to make it better).

