Index of /teaching/dav_20/labs/lab2/

      Name                                                                             Last modified         Size  Description 
up Parent Directory 02-Mar-2020 16:19 -

====     The importance of the (big) data  =====

Exercise 2

a) E.coli, human, yeast,
A. thaliana, D. melanogaster, C. elegans, Mouse, 
Zebrafish (D. rerio), Bacillus subtilis

Prepare bar plot (matplotlib) showing protein length for all 9 organisms:
- x,y axes should have description
- aggregate bars for the same group with different color
- add legend (upper-left corner)
- add error bar to each bar

Calculate percentage content of all amino acids and prepare table (PrettyTable module).
Additionally, prepare bar plot for percentage content of all amino acids for E.coli, 
human, yeast (thus group three bars for each amino acid).

b) PDB
- calculate the average length of protein and percentage content of all amino acids (just numbers)

Compare the result with the point (a). Can you explain the difference 
(hint: open in text editor)? 

c) UniProt
- full UniProt (Swiss-Prot)
- 200 randomly selected Bacteria
- 200 randomly selected Viruses
- 200 randomly selected Archaea
- 200 randomly selected Eukaryota

Prepare similar box plots and table as in (a).

d) data exploration:
- for each organism (a) and kingdom (b) make separate histogram for protein length
- calculate and plot median instead arthmetic mean
- instead bar plots, use "boxplot" function (only protein length)

Discuss which is better: median or arthmetic mean (prons and cons)?


Moreover, answer which amino acid is the most frequent 
at N-terminus? Can you justify why this one? Is it the 
same in each organism?


Additional material:

Prapare short report (pdf) containing all above plots, tables and answers to above questions and send it to until 08.03.2020.
Proudly Served by LiteSpeed Web Server at Port 80