Affymetrix GeneChip
Microarray Data Analysis
Using Microsoft Excel, Spotfire DecisionSite, Partek Pro, and Affymetrix GCOS
Boston University
Microarray Resource
May 2004
Data Pre-processing
Introduction
Following Affymetrix GeneChip microarray hybridization and scanning there are a few things that need to be done to create a data set that is ready for analysis. These include image quantification, normalization, and annotation.
Image Quantification
For Affymetrix GeneChips image quantification is performed using GeneChip Operating System software (GCOS currently v1.1). Starting with a scanned image GCOS determines the intensity of each 25mer probe on the GeneChip. Transcript expression levels are calculated using the intensities of the set of probes for each gene. This procedure is described in detail in the Affymetrix Statistical Algorithms Reference Guide. https://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf
Quality Control
Following image quantification it is advisable to take a look at a few quality control metrics that are generated for each chip. These can be viewed through the Analysis Info tab within the CGOS expression analysis view.
Ideally these quality control metrics will fall within a small range for all of the chips that will be analyzed together as one experiment. Similar quality control metrics indicate that the chips are technically similar, while divergent or outlying quality control metrics indicate technical difficulties that will impact downstream analyses. The most important of these quality control metrics are the SF (scaling factor), RawQ, and the 3’/5’ ratios for control transcripts. A control oligonucleotide is spiked into each hybridization mix. Matching and anti-matching probes for this oligo are arranged along the perimeter and corners of each array in a checkerboard pattern. The ratio of corner+ to corner-, which is usually greater than 100, is also an indication of the quality of the hybridization.
It may additionally be worth looking at the scanned image by opening the .DAT file. It is often possible to observe technical variability and outliers by looking at the raw images. Arrays may have physical blemishes that obscure a large portion of the array surface. In general it is not necessary to worry about blemishes that cover less than 5% of the array surface. The gridding algorithm implemented in GCOS has not been observed to fail to align the grid properly for a quality hybridization. Nevertheless, it is worth checking the grid alignment for each array, though manual alignment should not be done without proper justification.
Occasionally arrays suffer from a technical problem termed “high even background”. These arrays have a high signal on the vast majority of probes, including mismatch sequences. They can be identified by high levels of hybridization to the corner- probes. These arrays have a high RawQ score and a low signal to noise ratio for the corner+ / corner- comparison. It is seldom possible to compare data collected on high even background chips to higher quality data and these samples should be re-hybridized.
Chip to Chip Normalization
Following within-chip image quantification, it is necessary to normalize the data across chips in order to make measurements made on different chips as comparable as possible. There are a number of different normalization methods, but in general more complex methods will do a better job of normalization at the risk of over fitting. Furthermore, as more samples are added to a microarray data-set, individual chip to chip differences become less important. This will hopefully obviate the need for complex normalization. In general the Microarray Resource uses the simplest normalization method, linear scaling.
Linear Scaling
In linear scaling, the intensity of each gene on a chip is multiplied by a constant such that the average intensity of all the genes on that chip is scaled to a predetermined target. This normalization is performed within CCOS. In GCOS the highest 1% and lowest 1% of intensity values are ignored. The default target signal in GCOS is 500 units. The scaling factor reported for each chip is the value that each chip’s intensities were multiplied by to attain the target signal.
Quantile Normalization
In quantile normalization the absolute expression level of a gene is a function of its relative rank within the array. In quantile normalization, the intensity of each gene is ranked within each array. The average intensity of each rank across all arrays is then calculated. Finally, on each chip, the intensity of each gene is normalized to the average intensity of the gene of that rank across all chips. In the microarray core we sometimes use quantile normalization in order to compare chips with very different scaling factors.
It is usually easiest to perform quantile normalization in MS Excel.
1) Create two sheets, each containing the data to be quantile normalized
2) Sort each column of the first sheet individually by intensity in descending order.
3) Take the average of each row in the sorted sheet, generating the average intensity of each transcript of each rank.
4) Replace the intensities in the second sheet with the intensities of the average transcript of that rank. This is done using the RANK and VLOOKUP function. The calculation for the VLOOKUP function can take 5-10 minutes for each chip being normalized.
5) There are faster ways to perform quantile normalization in Excel, but they are more labor intensive.
Loess Normalization
In loess normalization, the intensity of the genes on a chip are normalized based on the local mean of signal intensities. In the microarray core we have not used loess normalization.
- Scatter plots
- Good / Bad Scatter plots
- The effects of normalization on a scatter plot
Gene to Gene Normalization
The normalization methods that have already been discussed make chips comparable. Other normalization techniques are sometimes necessary to make different transcripts comparable on the same scale. These methods are generally used prior to clustering, principal components analysis, and types of analysis. These normalization methods can all be implemented in Spotfire or Excel.
Log Ratio Normalization
In Log Ratio Normalization, the level of a transcript is expressed as the fold change, from baseline, in log scale. The geometric mean intensity of a transcript across all of the samples should be used as the baseline for log ratio normalization because this results in the average log ratio for a transcript across all of the samples being equal to zero.
In order to perform log ratio normalization in Spotfire, in is necessary to calculate the geometric mean of each transcript across all of the chips being normalized in MS Excel. This column needs to be imported into Spotfire along with the rest of the data and will serve as the baseline for normalization. Log ratio normalization can be performed in Spotfire by selecting DecisionSite Navigator, Tools, Analysis, Data Preparation, Normalization, Log Ratio Normalization. The choice of the base for the logarithm is arbitrary, Spotfire uses 10.
Z-Score Normalization
In Z-Score Normalization, the level of a transcript is expressed as the number of standard deviations above (positive) or below (negative) the arithmetic mean of the transcript across all of the samples. The average z-score for a transcript across all of the samples is equal to zero and the standard deviation of z-scores across all of the samples is 1.
Z-score normalization will not affect the statistical significance of differential expression as calculated by a standard ANOVA. It will also not affect the Pearson correlation coefficient calculated between two transcripts.
Z-score normalization is easy to do in Spotfire. Select DecisionSite Navigator, Tools, Analysis, Data Preparation, Normalization, Z-score Normalization.
Annotation
Introduction
In order to effectively analyze microarray data, it is critical for investigators to have access to complete and up-to-date annotation of the genes on the array. At the Microarray Resource we get our annotation information from two primary sources, though there are a few others that are worth mentioning.
NetAffx
Affymetrix maintains the NetAffx database containing information about the genes that are contained on their GeneChip microarrays. This is the best first source of information about Affymetrix probe sets because each probe set has a unique page in the NetAffx database containing a broad range of information including gene and probe sequences, links to other databases, and functional descriptions of the genes.
The full NetAffx data file for each type of array can be downloaded from the Affymetrix Website. The links for these files can be accessed from the main page by clicking on Products and then GeneChip Arrays. Select the chip of interest and then the link on the far right for “All Support Materials for ????? Array”. Select the CSV file from the NetAffx Annotation Files category. Enter your e-mail address and password for Affymetrix.com and the file will begin to download. Unzip the downloaded file and open it directly in MS Excel.
Incyte Proteome Database
The Incyte Proteome BioKnowledge Library [Link] is now available for access by all current Boston University and Boston University Medical Center faculty, staff, and students. This is an excellent database for finding information about genes from microarray experiments. It is well curated and provides Pubmed links for all references. This database is indexed by gene symbol.
The following URL will take you to the Proteome Database page for human VEGF;
https://www.incyte.com/proteome/HumanPD/VEGF.html
You can link to the Incyte Proteome Database in Excel using the formula;
=HYPERLINK("https://www.incyte.com/proteome/desireddatabase/genesymbol.html","L")
Identifying Differentially Expressed Genes
Introduction
With microarray data, biology researchers want to identify genes that are differentially expressed between different growth conditions or different treatments.
Fold Change
The most straightforward method of identifying differentially regulated genes in a microarray experiment is by fold change. Fold change is the multiple by which the expression of a gene changed between two experimental groups.
Fold change can be reported using various scales that convey the same information. For example a 4 fold up-regulation and a 4 fold down-regulation;
Linear: -4, 4
Ratio: ¼, 4
Log Ratio base 2: -2, 2
Log Ratio base 10: -0.602, 0.602
Fold Change is usually calculated using the mean of a set of measurements within an experimental group, but it can also be calculated using the geometric mean, particularly if the original measurements were not converted to logarithmic scale.
While Fold Change is an important descriptor of the behavior of a gene’s expression between two experimental groups, it does not tell the whole story. For example take the expression of one gene measured 4 times in each of two experimental groups.
Group A: 100, 200, 200, 300 Mean = 200
Group B: 100, 100, 200, 2800 Mean = 800
Fold Change = 4
According to Fold Change this is a differentially regulated gene while we can see that Group B is not reproducibly upregulated 4 fold. Furthermore, transcripts that are expressed at a low level are more likely to have a large fold change as a result of experimental error. Consequently, Fold Change should not be used as a first pass method for identifying differentially expressed genes.
Statistical Significance
A better method for identifying differentially regulated genes is provided by statistics. Analysis of Variance (ANOVA) is a technique that assesses whether a set of measurements from two or more experimental groups indicates, given observed variance, that the groups are different. For microarrays the measurements are the expression levels of one transcript and the groups correspond to the experimental sample groups. ANOVA is used to identify transcripts that are differentially expressed in a manner that is reproducible across multiple measurements within each experimental group.
An ANOVA score is calculated by comparing the variance observed between the sample group means to the variance observed within the groups. If the between group variance is high relative to the within group variance this indicates differential expression. The result of an ANOVA is a probability, p, that an observed difference between groups could have been produced by chance if the groups were in fact the same.
Following the use of ANOVA to calculate a p-value for each gene it is useful to choose a p-value cut-off, below which genes will be considered differentially expressed, and above which genes will not be considered differentially expressed. This cutoff will be arbitrary, but its’ choice should be made with an understanding of the trade-offs between sensitivity and selectivity that are inherent to choosing a significance cut-off. In general, choosing a lower significance cut-off will result in fewer genes being identified as differentially expressed, but a smaller portion of those that are selected will be false-positives. Choosing a higher significance cut-off will result in more genes being identified as differentially expressed, but a greater portion of those will be false-positives. At any significance cut-off it is possible to estimate the associated false-positive and false-negative rates. This allows an informed choice of the significance cut-off
ANOVA can take a few different forms depending on the experimental design. The most basic type of ANOVA is a one-way ANOVA. In a one-way ANOVA, the sample groups are stratified along a single experimental variable. The simplest one-way ANOVA, with two sample groups, is equivalent to a T-Test. The result of an ANOVA comparing more than two groups is the probability that any one of the groups is significantly different from the rest. At the Microarray Resource we perform one-way as well as multiple-factor ANOVA. Multiple-factor ANOVA differs from one-way ANOVA in that it generates p-value scores for more than one experimental factor as well as scores for each interaction between factors.
1 – Control / Experimental treatment
2 – Baseline (t=0) / Timepoint 1 / Timepoint 2
3 – Rosenberg cancer with patient affect
Two factor Anova
4 –
Add Figures
The easiest way to perform ANOVA on a microarray dataset is to use the software package Partek Pro. To import a microarray dataset into Partek, save it as a .txt file in Excel. Choose “import a .txt file” under the File menu in Partek. Select the .txt file and follow the import tool’s commands to import the dataset into Partek. To perform ANOVA on this dataset in Partek, transpose the data using the transpose command in the Transform menu in Partek. Insert columns into the transposed dataset that will describe the experimental factors associated with each sample. Name those inserted column and change their variable type to categorical. Enter values into those columns describing what experimental group each sample is in. Choose ANOVA under the Stat menu in Partek. Select the grouping variables, view the model, and run the ANOVA. ANOVA results can be exported from Partek by saving the results as a .csv file, which can then be opened in Excel.
Multiple Factor ANOVA?
Log transform in regard to anova
Normality distributiuon assumpotions
Multiple Hypothesis Testing
Correction of significance results for multiple hypothesis testing is an important concern in microarray data analysis. It is common to use a p-value cut-off of 0.05. In a microarray experiment in which 20,000 genes are measured, even if no genes are truly differentially expressed, 1,000 genes can be expected to meet the p < 0.05 significance cut-off by chance alone. Furthermore, in the same 20,000 gene experiment with no changed genes, one unchanged gene would be expected to have a p-value as low as 0.00005.
A statistic test, like ANOVA, applied to microarray data tells you the probability that the observations made about a single gene could have been made if the null hypothesis, that the gene is not significantly changed, were true. When applied to normally distributed random data, p-values will be evenly distributed between 0 and 1. Thus, when looking at a single gene, a very low p-value is a significant finding, but as you increase the number of genes observed, the chance of finding a single very low p-value increases. If we were to use a microarray to observe the expression of a single gene, we can use p-value cut-off of 0.05 and control false positives at a rate of 5%, but, unfortunately from a statistical standpoint, testing as many hypotheses as there are transcripts on a microarray gives plenty of chances to make a mistake.
There are a few different methods for dealing with multiple hypothesis testing in significance analysis of microarray data. The Bonferroni correction multiplies the significance observed for each hypothesis by the number of hypotheses being tested. The Bonferroni correction is usually overly stringent for microarray data analysis. If we use a Bonferroni corrected p-value cutoff of 0.05 on a real microarray data set, no matter how many genes meet the significance cut-off, there will be a 5% chance that a single false-positive will be among them. If we identify 100 genes that are differentially expressed in an experiment, we would likely be willing to accept a few false-positives among the 100. The Bonferroni criteria that there is only a 5% chance that a single false-positive is among the 100 is more control of false-positives than is usually necessary. Increasing selectivity using the Bonferroni correction reduces sensitivity, so fewer differentially regulated genes will be identified.
Another method for treating the multiple hypothesis problem makes more sense for microarray experiments. The False Discovery Rate (FDR) correction of Benjamini and Hochberg estimates the gene-wise false-positive rate among the genes at a significance cut-off. The FDR is the quotient of the number of unchanged genes expected at a given significance cutoff over the number of genes detected at that significance cutoff.
Add a formula including p and number of transcripts on chip
The assumption that unchanged genes would have p-values evenly distributed between 0 and 1 can be used to estimate the number of false-positives expected at a given significance cut-off. The number of false-positives expected at a given significance cut-off will be equal to the number of unchanged genes (or the number of genes on the microarray) times the p-value of the significance cut-off.
Describe how to calculate FDR in Excel
Estimating the number of changed and unchanged genes
Based on two assumptions, it is possible to estimate the number of changed and unchanged genes in a microarray data set. The first assumption is that unchanged genes will have p-values evenly distributed between 0 and 1. The second assumption is that changed genes will not have p-values greater than a certain p-value threshold.
If there are no changed genes with p greater than the threshold then all of the genes with p greater than the threshold are unchanged. If the unchanged genes have evenly distributed p-values, then the density of unchanged genes above the threshold will be the same as the density of unchanged genes below the threshold. So, we calculate the density of unchanged genes above the threshold, and integrate this constant density from p equals 0 to 1.
Bayesian Statistic Methods in Identification of Differentially Expressed Transcripts
In the ANOVA techniques that have already been described, each transcript is treated independently. A separate ANOVA is performed for each transcript and the significance for a given transcript does not depend on the data for any other transcripts. It is possible to improve the results of ANOVA analysis of a microarray data set by considering the data for every transcript when assessing the statistical significance of each transcript.
The following test was shamelessly lifted from Baldi & Hatfield, “DNA Microarrays and Gene Expression – From Experiments to Data Analysis and Modeling”, pp 97-134.
Many experimental designs and applications of gene expression profiling experiments are possible. However, no matter what the purpose of the experiment, a sufficient number of measurements must be obtained for statistical analysis of the data, either through multiple measurements of homogeneous samples (replication) or multiple sample measurements (e.g., across time or subjects). This is basically because each gene expression profiling experiment results in the measurement of the expression levels of thousands of genes. In such a high-dimensional experiment, many genes [transcripts] will show large changes in expression levels between two experimental conditions without being significant. These false positives arise from chance occurrences caused by uncontrolled biological variance as well as experimental and measurement errors.
Although these is no substitute for experimental replication, your confidence in the interpretation of DNA array data with a low number of replicates can be improved by using a Bayesian statistical approach which employs methods for robust estimate of variance. In this approach, the estimate of the variance of the expression level of a gene is improved by including the variance of additional genes in the neighborhood of the gene under consideration. In the implementation employed here, the neighborhood is defined in terms of genes with similar expression levels. More specifically, the variance of any gene within any given treatment can be estimated by the weighted average of the empirical variance of the gene calculated across experimental replicates and a gene-specific background variance estimated by pooling neighboring genes contained within a window of similar expression levels. Here the weight given to the background variance in a decreasing function of the number of experimental replicates. This leads to the desirable property that as additional replicates are performed, the plain and regularized t-test converge on a common set of differentially expressed genes.
Many of the genes identified by the simple t-test that are excluded by the Bayesian approach are genes that show small fold changes. In general, these genes with small fold changes identified by the simple t-test are associated with “too good to be true” within-treatment variance estimates, reflecting underestimates of the within-treatment variance when the number of observations is small. The elimination of this source of false positives by the Bayesian approach improves the identification of true positives.
There are currently two online tools that allow implementation of an ANOVA with a Bayesian estimate of the within-treatment variance;
Cyber-T (http://visitor.ics.uci.edu/genex/cybert/)
NIA Array Analysis ANOVA Tool (http://lgsun.grc.nia.nih.gov/ANOVA/)
Cyber-T is limited to a two group comparison, while the NIA ANOVA Tool can perform multiple group comparisons that are stratified along a single factor. Each of these online tools requires uploading the dataset as a tab or coma delimited file. Excel can save a dataset as either tab or coma delimited for easy upload. Each of these tools is relatively easy to use and has instructions online.
Affymetrix Present / Marginal / Absent Calls
The Affymetrix software GCOS generates a signal intensity for each transcript. It also generates a signal detection call ( P - Present, M - Marginal, or A – Absent ) for each transcript. This score describes the confidence with which a sequence specific signal could be detected for each transcript. Since the Affymetrix GeneChips have both perfect match (PM) and single-base mismatch (MM) probes for each transcript, the ratio of PM to MM signal gives an indication of the sequence specificity of the PM signal. The Affymetrix Statistical Algorithms Reference Guide (https://www.affymetrix.com/support/technical/technotes/statistical_reference_guide.pdf) describes in more detail the algorithms used to calculate the Present / Absent calls.
The Present / Absent calls generated by the GCOS software can be put to an interesting use. Discarding the transcripts that are called absent in all of the samples in an experiment reduces the number of hypotheses that are subsequently tested via ANOVA and can increase the power to detect differential expression of transcripts that are not discarded. In order to decide whether it makes sense to eliminate the transcripts that are absent in all samples prior to statistical analysis, it is important to understand the advantages and disadvantages that come with this decision.
Pros of removing transcripts called absent in all samples;
- In experiments with low statistical power, eliminating Absent transcripts can result in more transcripts being identified as differentially expressed at a given level of significance.
- Transcripts that are differentially expressed, but are called absent in all samples are likely to have high levels of non-specific hybridization, making these results difficult to interpret.
Cons of removing transcripts called absent in all samples;
- Transcripts that are eliminated cannot be used in downstream analysis. This may result in the elimination of interesting transcripts.
- Generally 25%-50% of the transcripts in an experiment are discarded; this will result in at most a 33% to 100% increase in power.
- If differentially expressed transcripts are equally frequent among Present and Absent transcripts, then eliminating Absent transcripts will not result in increased statistical power.
It is possible to identify and eliminate transcripts that are called absent in all samples using either Excel or Spotfire. In Excel, the AND function can be used to check if a number of cells are all equal to “A”. In Spotfire select DecisionSite for Functional Genomics, Data Analysis, Analyze Affymetrix absence / presence calls.
Volcano Plot
In a Volcano Plot, the fold change and significance for each gene are displayed as a scatter plot. Both fold change and significance are generally plotted in log scale. The spots take a characteristic volcano form because absolute fold change is correlated with significance.
Volcano plots can be used to demonstrate fold change and significance cut-offs. Volcano plots are also an excellent way to visualize the changes that occur in a group of genes.
Talk about making volcano plots comparing more than two groups?
Heat Map
Principal Components Analysis
Technique
Principal Components Analysis is a mathematical transformation that can be applied to microarray data sets allowing data compression and dimensionality reduction. The primary objective is to transform the data into a new space where data analysis is easier. Principal components analysis transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible while retaining orthogonality.
The mathematical technique used in PCA requires solving for the eigenvalues and eigenvectors of a microarray data-set in matrix form. The eigenvector associated with the largest eigenvalue has the same direction as the first principal component. The eigenvector associated with the second largest eigenvalue determines the direction of the second principal component, etc.. The maximum number of eigenvectors equals the number of columns (samples) of the microarray data-set.
At the Microarray Resource we use principal components to view distribution of variability within the various samples that make up an experiment.
Figure Showing PCA
Looking at samples using PCA
Looking at genes using PCA
Hierarchical Clustering
Clustering Transcripts
Hierarchical clustering is a technique that is used to group transcripts with similar expression profiles from a microarray experiment.
At the Microarray Resource we use hierarchical clustering to visualize the expression profiles of a group of genes that have been selected using other statistical methods. Hierarchical clustering is easy to do in Spotfire. It is important to use a normalization methods that makes transcripts comparable (such as Z-score or log ratio normalization) prior to hierarchical clustering.
Gene Clustering example (data before clustering / data after clustering)
Clustering samples
Hierarchical clustering can also be used to cluster samples based on the expression profile.
Sample Clustering example (data before clustering / data after clustering)
K-Means Clustering
K-Means clustering is a technique that is used to divide genes into discrete groups. The person analyzing the data decides how many groups are desired and the software breaks the transcripts into that many groups. K-means clustering can be performed in Spotfire. It is important to use a normalization method that makes transcripts comparable (such as Z-score or log ratio normalization) prior to K-means clustering.
The plot below shows a collection of transcripts broken up into 2 groups using K-means clustering. The transcript profiles are Z-score normalized. The bold lines are the average expression of the transcripts in that group and were generated by exporting the results of K-means clustering from Spotfire into Excel, calculating the cluster averages, and importing back into Spotfire.
Biological Data Mining and Pathway Analysis
EASE
Expression Analysis Systematic Explorer (EASE) is a software tool that can be used to identify biologically coherent groups of gene from a list of differentially expressed transcripts from a microarray experiment. EASE uses the data from a variety of public gene-information databases such as SwissProt, PFAM, and GO, to create an internal database of gene categories. EASE allows the user to feed it a list of Affymetrix Probe Set ID numbers or other gene identifiers and it distinguishes categories of genes that are over represented within the user’s list. It is important to understand the gene “categories” that are used by EASE in order to understand what the software does. In EASE, a category is a group of genes that share a common functional description. For example, Homeobox is an EASE category built from the PFAM database Homeobox domain entry. The Homeobox category contains all of the genes that contain homeobox domains according to the PFAM database.
EASE is a freeware program and it is available to all users of the Microarray Resource. (http://david.niaid.nih.gov/david/)
GeneChip probe design
Patient/Uninteresting affects for statistics section
Gene Filtering
Filtering Samples / Bad Samples
Genmapp
Hierarchical clustering and limitation of hierarchical clustering and how to do HC