Harvard University, fas   
Fall 2002   


Biophysics 101

Genomics and Computational Biology

George Church, Ph.D. (HMS)






Info &








Project Ideas, 2002

Genie Hainsworth


I work at Harvard Medical School. I live in Davis Square Somerville; I pass by MIT and Harvard on my way home. So I can meet in any of those places in the evening, or at lunchtime in the Longwood Area.

Protein-Protein Interactions: Network Structures

I would like to look at data from protein-protein interaction experiments, and construct networks showing their connections. By examining the structure and connectivity of these networks, I hope we can deduce something about the roles of particular proteins.

For example, a protein that plays a part in a signal cascade might have a small-world network: connections to first-degree neighbors, which then connect to second-degree neighbors, etc. On the other hand, a protein that is central to a large complex might look like a hub in a hierarchical network.  The goal is to be able to predict these roles from just binary (pairwise) interaction data.

The project would involve finding an appropriate set of binary interaction data (pairs of proteins) and using that to construct networks. Eventually I want to use protein microarray data (see below*), but to start I think yeast two-hybrid data would be appropriate. For validation, we could use large-scale immunoprecipitation data to verify what proteins are involved in complexes.

* Protein microarray data: Niro Ramachandran, a post-doc in the Institute of
Proteomics lab at Harvard Medical School, is developing a protein chip to study protein-protein interactions (http://www.hip.harvard.edu/research/protein_microarray/index.htm)

The protein chip is created by depositing a microarray of different plasmid DNAs on a glass slide, then using in vitro transcription/translation (IVT) to produce the corresponding proteins. The genes in the plasmids all include an affinity tag, so that as the proteins are made, they become immobilized on the slide. Already, part of my work is to analyze scanned images of these slides, and quantitate. We may have a small amount of real data by early December, which I would like to start analyzing with the method we develop in the project.

Steven Corsello


Aim: To correlate microarray data with the promoter site consensus sequence for a specific transcription factor.

Most transcription factors bind a known consensus sequence in the promoter region of a gene. However, often the sequence contains degeneracies, such as TTNNNNNAA. This project would develop a model in which to score tendencies for a particular base to be incorporated into the site, and then compare this result with the fold of gene induction reported on the microarray.

This project can be done in either a human or yeast system. Yeast would likely be more straightforward since gene transcription is better characterized in this system, and the promoter region is easier to identify.

My background is in biochemistry and cell biology, so a computer scientist, statistician, or someone with experience interpreting microarray data would be particularly helpful.

Joshua Rene Lacsina


I would like to focus on genomic analysis of parasitic human pathogens, particularly Plasmodium falciparum, one of the causative agents of malaria, and Leishmania major, the causative agent of leishmaniasis, an endemic disease primarily affecting the third world. The complete genome of Plasmodium falciparum has been completed, with significant annotation published for chromosomes 2 and 3. In contrast, only chromosome 1 of Leishmania has been completed. I developed the following project idea based on a series of articles focusing on malaria genomics published in Molecular and Biochemical Parasitology, available electronically via Hollis:

Finding novel motifs in the P. falciparum genome based on an algorithm that searches for repetitions in DNA sequences. I still haven't thought of a good way to enrich my dataset for biologically-relevant/interesting things, though the complete sequence of the mosquito genome could be useful...I'm open to suggestions...

Let me know via e-mail if you are interested. Here are some references--the first reference is the journal issue I referred to above (available on Hollis) containing several articles pertinent to this topic, so take a look at as many of them as you wish:

Molecular and Biochemical Parasitology. (2001) v. 118: pp. 127-302.

GlimmerM. http://www.tigr.org/software/glimmerm/.

Huestis R and A. Saul , An algorithm to predict 3' intron splice sites in Plasmodium falciparum genomic sequences. Mol. Biochem. Parasitol. 112 (2001), pp. 7177.

Myler PJ, et. al., Leishmania major Friedlin chromosome 1 has an unusual distribution of protein-coding genes. PNAS 96 (1999), pp. 2902-2906.

Leo Aristotle S. Hizon


Simulation of the recombination of antibody genes by using perl to predict the amino acid sequences of the variable region of the antibody.

-modeling or semi-random recombination events in B-cells                   
-translation of nucleotide sequences to amino acids
-finding the constraints in recombination of Ab genes
-predictive modeling of Ab specificity based on the allowed combinations of amino acids

Dynamic Programming analysis of Th2 chemokine receptors and ligands nucleotide and protein sequences.

-the evolutionary relationship between Th2 chemokine receptors
-finding sequence homologies and structural homologies via perl

Bob Brady


The Determination of a General Set of Fine-Grained Selection Criteria for the Discovery of siRNA in Humans

It has been reported that small interfering RNA (siRNA) can induce gene silencing in Humans.  The small (~ 21 bp) double-stranded siRNA triggers the degradation of mRNA that matches its sequence.  This effect can be used to determine the function of specific genes.

The problem is that a wide variety of unreported or proprietary results have been used for the determination of fine-grained selection criteria in the discovery of siRNA.  Fine-grained selection criteria include, but are not limited to: %gc content, position from start codon, and homolog parameters from a BLAST search of the proposed siRNA. 

The goal of this study is to review the siRNA sequences and corresponding gene silencing results reported in the literature, calculate the fine-grained selection parameters, and determine a general set of selection critera.  Not all published results list the actual siRNA sequences.  It will be required to implement the selection method described in the journal article with a Perl script and BLAST queries in cases where the actual siRNA sequence is not given.

Additionally, Perl and Mathematica will be used to parse the data, calculate results, and visualize data/results.

David Twomey


Project Idea #1

With genomics and mouse genomes now available one may now ask if the newly available genomic sequence information may at least partially explain gene expression patterns. There was some work on looking into whether genes that have similar expression patterns also share up-stream regulatory sequences.

Project Idea #2

1) There exists much functional knowledge on biochemical and signaling pathways, also on protein interactions.

2) Novel methods, such as gene network reverse engineering, driven directly by gene expression and molecular activity data, can infer functional, regulatory interactions between any of the genes measured, indpendent of whether previous functional annotation exists.

3) In some cases, reverse engineered networks will make predictions that are concordant with known functional interactions. In other cases, predictions from reverse engineering will go beyond (e.g. functional predictions on novel genes) or contradict current functional knowledge.

4) The case has been made that integrated network models should include components from reverse engineered networks, and known signaling pathways. HOW DO WE MANAGE CASES IN WHICH CONFLICTING, CONTRADICTING OR "SPECULATIVE" FUNCTIONAL PREDICTIONS ARE CONTRIBUTED BY THE VARIOUS INFORMATION SOURCES USED TO BUILD A NETWORK MODEL.

Julian Bonilla


My project idea deals with developing a HMM for determining nucleosome positioning, based on ratiometric data generated by microarray experiments.  I have access to some raw data generated by a biologist at CGR.  I've also worked on developing a browser that displays and annotates the data, but I think this might be too difficult to complete in the time given.  I'd be interested in working with anyone that has a statistics or math background to further develop the model.

Faraz Waseem


I have a project idea. The idea is that I want to develop an engine (or program) to predict protein function on context basis (non-homologous approach). It will be a rule-based program. I have read an article on this issue and it seems to be a good computational biology problem.

Richard Xu



TNF Receptor Biomining


Tumour necrosis factor(TNF) superfamily has been been identified to play pivotal roles in the organization and function of the immune system [1].  Although there have been 29 TNF receptors identified, more TNF receptors are yet to be found among human genomes.  The project here is aimed at taking innovative approach to find possible gene sequences that could potentially encode TNF receptor by utilizing TNF receptor’s several features.  Many existing tools will be used, BLASTX, HMM, TMAP and PFAM, to name a few.  Computational result will be provided to prove the efficacy of such approach.


The whole process will be divided into two big steps: discovery and validation.  

Discovery Phase:

In discovery phase, we will use the existing known TNF receptor sequences and  take advantage of the presence of cysteine-rich-domains in TNF receptor to build HMM.  We will then apply the HMM to find out all the candidate DNA sequences that can potentially code the proteins in conformance to this HMM model.  We then BLASTX these genes to get their encoding proteins.

Validation Process:

We will have several criterions to sift out unqualified proteins and identify the most potential candidate.

Comparing Variable Selection Methods for Microarray Classification Models Based on Logistic Regression

An important application of microarray technology is categorizing tissue or cell types based on their gene expression profiles. There are numerous methods available--clustering, self-organizing maps, and neural networks are just a few. Each of these create a mathematical model using the gene expression levels from a set of "training" arrays. In the case where the tissue or cell samples are associated with a disease, the models can be used to (1) search for key genes involved in the disease process, (2) identify subclasses of the disease, or (3) diagnose the disease.

My lab has been studying the use of logistic regression for classifying microarrays, and we have shown that it can perform as well as (and often better than) other models, with the added advantage that it requires far fewer genes to make accurate predictions about the identity of "test"arrays. The most difficult part, though, in creating a logistic regression model arises from the fact that the number of genes on an array (measured in the thousands) is much greater than the typical number of training arrays (usually just a few dozen). As a result, many different combinations of genes can be used to create a logistic regression model that perfectly fits the training data; however, these sets of genes vary widely in their ability to fit the test data.

The question I hope to answer with my project is, given only the training data, how do you choose the combinations of genes that will most likely fit well to the test data. In other words, is it possible to distinguish genes, which by random fluctuations in expression level happen to correlate well with the training data, from other genes whose high correlation with the training data has a real biological basis? Therefore, I plan to compare several well-known variable selection methods and design some algorithms of my own to determine which set of genes is best to use in a logistic regression microarray classification model.

Griffin Weber   weber@fas.harvard.edu


 Using the Index of Coincidence to identify Open Reading Frames


Every language has what is called an Index of Coincidence.  The Index of Coincidence (IC) is defined as the probability that two random elements of a string are identical and can be calculated from the frequency histogram of the string.  English has an IC of about 0.065 and random data has an IC of about 0.038.  Different languages have different Indices of Coincidence, depending on their particular pattern of alphabet use.

Our project will attempt to evaluate the role of the Index of Coincidence in helping to identify open reading frames and to distinguish them from non-coding sequences.  We suspect that, within species (and perhaps, regardless of species) coding and non-coding sequences will exhibit characteristic Indices of Coincidence much in the way that different languages do.  If we can demonstrate that there is a difference between ICs of ORFs and non-ORFs, then this difference can be used to help identify ORFs in unknown sequences.


Jeanhee Chung   jachung@attbi.com                  Thomas Lasko  tlasko@mit.edu

Transcriptional control mediated by cleansing of short sequences from gene regulatory regions

Differentiation of cells and their responses to stimuli are in large part made possible by tight regulation of gene expression. This control is executed primarily at the level of transcription initiation. The current theory describes trans-acting transcription factors binding to cis-regulatory sequences within, or adjacent to genes, as a primary mode of regulation of expression. Combination of many different cis binding sites located close to any gene would explain the complexity of transcriptional responses to stimuli and co-expressed genes should in principle share similar patterns of transcription factor binding sites.

Computational methods for analysis of transcriptional regulation rely frequently on annotation of regulatory elements located in proximity of the studied genes and comparison of arrangements of these elements between co-expressed genes or between homologous genes among various species.

We speculate that the absence of or negative bias towards specific sequences in the regulatory regions of co-expressed genes might add another degree of regulation. Sequences cleansed from regulatory regions of co-expressed genes might serve as “disruption sites”. Disruption of transcription might be achieved in many ways, e.g. by binding proteins that would make spatial arrangement of other trans-acting factors impossible, by binding short silencing RNA sequences or by changing unfavorably the local conformation of DNA strands.

Pawel Wolkow   Pawel.Wolkow@joslin.harvard.edu                  J Singh   J.Singh@FMR.com

Rolf Hanson


I'm interested in issues dealing with retroelements, viruses and genome  evolution - how the study of retroviruses can be used to learn more about the coevolution of retroviruses and their human hosts. This is kind of broad and fuzzy, and I am looking for a specific problem that would be appropriate for a course project. An idea would be to look for retroelements in genome databases using algorithms based on the structure of the retroelements, rather than homology.

That said, I am more of a hacker and unfortunately would probably be more excited about creating a cool piece of software or computer graphics, rather than demystifying the mechanisms of human evolution.  I am good with PERL, python, C, Java, 3-D graphics (OpenGL), macintosh programming, unix, etc. I work at children's hospital (www.chip.org) with some guys who wrote a book about microarrays, so potentially have access to people who know a lot about bioinformatics.

Anna Mallikarjunan


My fairly non-existent biological backgound is making it hard for me to choose a problem that will be both interesting and relevant.

However, one area I am interested in is to develop a partial software solution (dependent on the time constraints) that provides a visual interface to nucleotide mutations. I am looking to biologists to suggest what they would want to visualize when analyzing nucleotide mutations.

The skills I can bring to a team are experience in a variety of software platforms and programming paradigms.

Matt Paschke



I have an AB in Computer Science and have done a fair bit of programming, including a good bit of programming in perl and related languages.  I am also in the middle of taking a molecular biology class, a genetics class and an organic chemistry class.

My interests probably fall into two broad categories.  The first would be the problem of gene location -- finding where genes are in the vast amount of collected genomic data.  The secound would be data representaion – how to most efficiently represent genetic data to make searching and sequence alignment more efficient.  Spending hours waiting for a BLAST search is still too slow.  These are just broad interests.  I would love to work with anyone who wants to work at the interface of the CS and the biology.

Daniel Rosenband


I'm a computer science graduate student at MIT.  My research interests are in supercomputing, computer architecture, and hardware design.  I've only taken introductory undergrad. biology, so I'm looking to be part of a team that consists of at least one or two other people with a strong biology background. The type of project I would like to work on is one that involves some aspect of high-performance computing -- novel algorithms to take advantage of large machines, simulating a complicated biological process, or finding a biological problem that dedicated high-performance hardware could cost-effectively solve. 

My office and apt. are on MIT campus, so I can easily meet with people to workon the project either at Harvard or MIT.

Atif Khan


I have interests in two directions

1) Machine learning approaches (in particular neural networks) to predict RNA and protein secondary structure.

2) Information theory and evolution - in particular exploring the relation between the theory of error correcting codes, signal/noise propagation and evolutionary constructs like mutation, selection, drift etc. The idea here would be to understand how evolution preserves "informational complexity".

Dan O'Brien 


I have 3 ideas for projects.

1. Looking at mutant p53 in clam leukemia cells for homologs


2a. To attempt to find a correlation with physical cell stress or size and gene expression. There is work going on growing cells on a scaffold or grating that can be expanded to place the cells under tension.  Heart cells grown on micro pegs exhibit electrical properties different from cells grown in a culture medium. I have no idea how I'll get the data.

2b. Cells grow and shrink in size during the cell cycle is it possible to correlate this expansion and contraction with up regulation of genes?

The data from this could come from the paper that we read this week

3. Is DNA used as a framework or building block to increase cell size in single cell organisms?  Is it possible to correlate cell size and amount of DNA?



Heta Ray 


Skills: Java, Databases, Object Modeling, Machine Learning and Data Analysis

Genomic and proteomic approaches can provide hypotheses concerning function for the large number of genes predicted from genome sequences. Due to the artificial nature of the assays, however, the information from these high-throughput approaches should be considered with caution. Although it is possible that more meaningful hypotheses could be formulated by integrating the data from various functional genomic and proteomic projects, it has yet to be seen to what extent the data can be correlated and how such integration can be achieved.   I would like to speculate and co-relate the the mRNA abundance to the presence/absence of proteins.  This correlation between mRNA abundance to the presence/absence of proteins can be used to improve the quality of hypotheses based on the information from both approaches.  A test and traing set to be created and we could use machine learning (Neural Networks, Bayesian methods etc.) to predict the outcome

Another biological problem could be identifying genes responsible for human diseases by combining information about gene position with clues about biological function.  The recent availability of of whole genome sets of RNA and protein expression provides powerful new functional insights.  These data sets could be used to expedite disease genes discovery - we could assign a 'score' for each gene, based on similarity in the RNA expression profile to known mitochondrial genes .  Using a large survey of organellar proteomics, genes can be classified according to the likelihood of their protein product being associated with the mitochondria.  The intersection of this information could narrow down the search for the possible gene candidates.

Joe Weber



Identification of Potential Transcriptional Regulatory Elements by Comparison of Human and Pufferfish Genomic Sequences.


BIOL E-101 Project Proposal by Joe Weber, 11/5/02



The goal of this project is to identify potential transcriptional regulatory elements that are likely to play a conserved role in vertebrate body patterning.  To accomplish this goal, the amino acid sequences of human genes thought to be important for body patterning will be used to search the pufferfish (Fugu rubripes) genome for closely related genes.  The 5’ flanking and intronic sequences of likely homologs will be compared to identify clusters of conserved transcription factor binding sites that may serve as promoters, enhancers, or silencers.

       There are two main reasons for choosing the pufferfish and human genomes for this project.  First, nearly all of each genome has been sequenced and is publicly available.  Second, these two species are separated by approximately 450 million years of evolution.  Over such a great period of time, the non-coding regions of a gene (promoter and introns) should have undergone extensive mutations.  Therefore, the only sequences that are likely to be conserved are sequences that play a critical  functional role, such as important transcriptional regulatory elements.


Here is a summary of how I will proceed with this project:


Step 1: Create a list of human genes for which there is published evidence indicating that the gene product plays a role in body patterning.  In many cases, the experimental evidence will be from homologous proteins in animals commonly used for embryological studies, such as mouse, Xenopus, and Zebrafish.  This list should include at least a few dozen genes, since part of the project goal is to see how common (or uncommon) it is to have highly similar promoter/enhancer sequences between human and pufferfish homologs.  This list will include well studied transcription factors and signaling molecules such as members of the Zic, Gli, Sox, BMP, Nodal, and Wnt families.


Step 2: Use the human amino acids sequences to conduct BLAST searches against the pufferfish genome (http://genome.jgi-psf.org/fugu6/fugu6.home.html) and download the genomic sequences of likely homologs.


Step 3: Search for open reading frames and intron-exon splice site consensus sequences in order to verify that a genomic sequence found by the blast search is likely to code for a real protein.  The pufferfish genome web site has a listing of more than 30,000 predicted protein sequences that will be very helpful.  The predicted amino acid sequence for a pufferfish protein will then be aligned to the human sequence used for the BLAST search in order to determine if it is similar enough to be a likely homolog.


Step 4: Use a Smith-Waterman local alignment to search for regions of high similarity in the promoter and intron sequences of likely human-fish homologs, and use the MatInspector program to search for potential transcription factor binding sites based on matrices from the TRANSFAC database.


Step 5: The final product of the steps above would be a set of gene maps and tables listing potential transcription factor binding sites  that appear to be phylogenetically conserved.  It might be possible to confirm at least some of these predictions by doing a thorough search of the literature to see if any of these elements have already been identified by empirical methods such as protein-DNA binding assays and promoter-reporter gene assays.


Reasons to think that this approach will be productive:  I recently worked on a project where I cloned the Xenopus (frog) gene for Zic3, a transcription factor involved in vertebrate body patterning.  When I compared the frog and human Zic3 sequences, I found a 120 bp sequence in the middle of the first intron that had 82% identity between species.  This similarity was quite striking, since the rest of the intron had only about 22% identity.  Because frogs and humans are separated by more than 300 million years of evolution, I thought that this conserved sequence was very likely to be a functional regulatory element.  I tested this hypothesis using a variety of promoter-reporter gene assays  in Xenopus embryos, and found that the conserved sequence was a transcriptional enhancer that responded to the activin/nodal-related signaling pathway, which is known to induce the endogenous Zic3 gene.

       I would like to see how common these kind of conserved regulatory regions are between homologous genes in distantly related vertebrate genomes.  Unfortunately, there is very little genomic sequence available for Xenopus.  However, the availability of the nearly complete pufferfish genome sequence, and the 450 million years that separates humans and pufferfish, makes fish-human comparisons an attractive approach for studying conservation of transcriptional regulatory elements.


Note:  I currently plan to carry out this project using existing software tools.  However, what might be an interesting related project for someone with a stronger computer science  background than I have is as follows:


The process of taking a human protein sequence and BLASTing it against the pufferfish genome is quite fast.  However, extracting the relevant pufferfish genomic sequence and annotating its complete exon-intron structure can be quite tedious, especially if this process is going to be repeated with a large number of genes.  I suspect that a great deal of this process could be automated, but I don’t know of an available program that does all of what I would like it to do.

       It would be great to have a program that a researcher could just give as input a known protein sequence from one species and a genomic sequence from another species (selected based on a BLAST search), and then have the program map out the best fit homolog it can find in the genomic sequence.  The output of such a program would include the predicted amino acid sequence of the potential homolog and its percent identity with the input protein sequence.  It would also include a table or map of the predicted genomic exon-intron structure with position numbers.  Predicting exon-intron splice sites a priori is usually quite difficult, because there is a great deal of flexibility in the splice site consensus sequences. However, in this case a comparison of the input amino acid sequence with all of the open reading frames of the genomic sequence should narrow down the search space by quite a bit.  Finally, it would be very useful if the program would copy the non-coding sequences (5’ flanking region and introns) to separate files, so that they can then be used in searches for transcriptional regulatory sequences.


Gregory Minevich



I would like to investigate one of the major pieces of evidence for the neutral 
theory of molecular evolution.  Advocated in part by Motoo Kimura, this idea 
states that most evolution at the molecular level is not the result of natural 
selection, but rather the result of random genetic drift.  In other words, most
molecular variation in DNA and protein composition has no influence on the 
selective fitness of the organism-- it is selectively neutral.  
Molecular evolution and morphological evolution are quite independent of one 
another in that variation onthe morphological level has been definitively 
demonstrated to have an influence on the fitness of the phenotype.  Ridley (p173)
gives the example of the living fossil shark Heterodontus portusjacksoni – a 
species that closely resembles its fossil ancestors  from 300 million years 
ago.  Though the rates of molecular evolution in humans and this shark have been
roughly equal over the past 300 million years, their rates of morphological 
evolution have been astonishingly different.  Whereas the shark closely
resembles its ancestor from that time, humans have evolved from fish-like 
ancestors and have passed though amphibian, reptilian and mammalian stages.
According to Ridley (p151), 4 main types of observation have been used to 
determine whether natural selection or neutral drift drives molecular
evolution.  These are:
1) The rate of evolution and the magnitude of polymorphism
2) The constancy of molecular evolution (known as the molecular clock)
3) The relation between functional constraint on molecules and their rates of
4) The relation between polymorphism and evolutionary rate in different
 molecules (or parts of molecules) 
Of these four observations or tests, I would like to single out the constancy of 
molecular evolution for investigation.  Whereas the rate of protein evolution runs
relative to absolute time, the molecular clock runs relative to generation time for
silent base changes in DNA (changes that do not result in an amino acid change 
in the protein).  According to Ridley, selection likely gives a better explanation 
for the case of protein evolution whereas neutral drift better explains silent base 
changes in DNA.  
Overall Goal:
Does Neutral Drift or Natural Selection Better Explain the Molecular Clock 
Steps Along the Way:
* Investigate whether the latest research still has each respective clock running 
relative to absolute time and generation time respectively 
* Ridley (p178) suggests an argument for how natural selection can explain why 
the protein molecular clock keeps absolute, rather than generational time.  The 
argument supposes 2 organisms with different generation times, "both separately
evolving in relation to changes in parasites with short generation times".  I would 
like to model this argument using a combination of Perl and Mathematica where I 
could experiment with changing such variables as: mutation rates (of organisms 
and parasites), selection variables, and generation times (of organisms and 
parasites).  I would then compare the results of this modeling experiment with the 
latest research findings to achieve the overall goal above.
Gould, S.J. 2002 The Structure of Evolutionary Theory
Kimura, M. 1968 Evolutionary Rate at the Molecular
Level. Nature 217:624-626
Dawkins, R. 1986 The Blind Watchmaker
Ridley, Mark. 1996 Evolution 
Some Further Reading:
Bulmer, M 1988 Evolutionary Aspects of Protein
Synthesis. Oxford Surv. Evol. Biol. 5:1-40
Bulmer, M 1988 Estimating the Variability of
Substitution Rates. Genetics 123:615-619
Gillespie, J.H. 1991 The Causes of Molecular
Gillespie, J.H. 1993 Episodic Evolution of RNA
Viruses. Proc. Nat. Acad. Sci. USA 90:10411-10422
Nichol, et al 1993 Punctuated Equalibrium and Positive
Darwinian Evolution in Vesicular Stomatitis Virus.
Proc. Nat. Acad. Sci. USA 90:10424-10428
Ohta, T. 1992 The Nearly Neutral Theory of Molecular
Evolution. Ann. Rev. Ecol. System. 23:263-286 
Ohta, T. 1993 An Examination of the Generations-Time
Effect on Molecular Evolution. Proc. Nat. Acad. Sci.
USA 90:10676-10680



BioPhysics 101

Term Project Proposal

Overlaying Clustering Results from PCA with Clustering Results from Self-Organizing Maps



Group memeber:

Amy Chang 

Shixin Zhang

Yan Wang





(Note: each of group member sumitted the same project proposal to his/her TF)
For our class project, we chose to look at “Class and pattern discovery in microarray data”.  Specifically, our problem definition is as follows…


We are given:

1)      DNA microarray data containing the expression profiles for 97% of the known or predicted genes of Saccharomyces Cerevisiae. The microarray data measures changes in the concentrations of the RNA transcripts from each gene for seven successive intervals after transfer of wild-type (strain SK1) diploid yeast cells to a nitrogen-deficient medium that induces sporulation.  This dataset comes from a published paper: The Transcriptional Program of Sporulation in Budding Yeast, S. Chu et al[i].  The dataset can be downloaded from http://cmgm.stanford.edu/pbrown/sporulation.

2)      Results of analysis done by S. Chu concluded that there are “At least seven distinct temporal patterns of induction were observed”.  Their conclusion comes from having done a clustering analysis of the data using self-organizing maps.


Our challenge:

1)      Perform principal component analysis on the dataset from S. Chu et al

2)      See if