Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / Numerical Methods in Chemistry with MATLAB Final Work  Problem 1  In this problem you will use experimentally measured protein-DNA binding data from human stem embryonic cells (ESCs)

Numerical Methods in Chemistry with MATLAB Final Work  Problem 1  In this problem you will use experimentally measured protein-DNA binding data from human stem embryonic cells (ESCs)

Math

Numerical Methods in Chemistry with MATLAB

Final Work

 Problem 1

 In this problem you will use experimentally measured protein-DNA binding data from human stem embryonic cells (ESCs). The data comes from the paper (Meissner et al.): 

 You are given a list of 10 different protein-DNA binding experiments obtained for 10 proteins. Each experiment measures the coordinates of the protein-DNA binding locations (Peaks). The list of proteins and the names of datafiles can be found in the file:

ANNOTATION_ChIP_seq_NM.xlsx

 

For example, the datafile GSM1505704_Otx2_020912_h64.bed.peak.txt contains 100778 lines, where each line has the following format, for example:

                chr#     Peak Start        Peak End            Peak#     Peak Strength

1                 11067 11747    1              66.18

Here chr# is the chromosome number, Peak Start and Peak End represent the genomic coordinates of the peak start and the peak end, respectively.

 

In order to extract DNA sequences, you need to use the sequence of chr22. This sequence is uploaded in Moodle in the file chr22.fa in the folder Final Work Chromosome File. In order to read this file you need to use MATLAB function fastaread().

 

1.            Write MATLAB script that reads the file ANNOTATION_ChIP_seq_NM.xlsx and automatically extracts DNA sequences bound on chr22 by each protein and saves them in separate text files, one file per protein.  (20 points)

2.            Continue your MATLAB script and perform the analysis of the GC content of extracted DNA sequences from chr22 for each protein. The GC content is defined for each DNA sequence as GC_CONTENT=(N_G+N_C)/L, where N_G and N_C is the number of C and G nucleotides, respectively, and L is the sequence length. Generate automatically a panel with 10 plots, where each plot represents the normalized Probability Distribution of the GC content of bound DNA sequences for a given protein. Label the axes and automatically make titles representing the protein names for all graphs. (20 points)

3.            Plot all 10 graphs from (2) in one combined plot. Generate automatically the legend (representing protein names) for the curves.  (10 points)

4.            Plot the results obtained in (2) using the boxplot() function.  (10 points)

5.            Perform the analysis of Kolmogorov-Smirnov p-values between the obtained probability distributions for all possible pairs of proteins. Your code should generate a table containing the resulting p-values. What conclusions can you derive? (10 points)

6.            Repeat (2-5) analyzing (instead of the GC content) the relative frequency of the occurrence of the following four sequence patterns:  

1:  [CC] or [GG]

2:  [CNC] or [GNG]

3:  [CNNC] or [GNNG]

4:  [CG]

            Here ‘N’ stands for any nucleotide type, A, T, C, or G. You should consider each pattern              separately. The relative frequency of pattern is defined for each sequence as the number of               patterns found in this sequence divided by the sequence length.       (20 points)

7.            Based on your results, what conclusions can you derive regarding protein-DNA binding specificity? In order to answer this question, you need to think which pattern (out of the four different patterns in (6)) results in the statistically largest difference between different proteins. From your analysis of Kolmogorov-Smirnov p-values (see (5)) between the obtained probability distributions for all possible pairs of proteins, you should be able to conclude regarding the DNA sequence specificity.  (10 points)

Write your MATLAB script for all questions in a single .m file! Mark each question clearly in your script! Submit also Word or PDF file summarizing your results and graphs. 

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE