

Biological networks to the analysis of microarray data
- 期刊名字:自然科学进展(英文版)
- 文件大小:863kb
- 论文作者:FANG Zhuo,LUO Qingming,ZHANG G
- 作者单位:Hubei Bioinformatics and Molecular Imaging Key Laboratory,Shanghai Center for Bioinformation and Technology
- 更新时间:2020-11-22
- 下载次数:次
PROGRESS IN NATURAL SCIENCEVol. 16 ,No. 12 ,December 2006Biological networks to the analysis of microarray data'FANG Zhuo' , LUO Qingming' ** , ZHANG Guoqing' and LI Yixue( 1. Hubei Bioinformatics and Molecular Imaging Key Laboratory ,Huazhong University of Science and Technology ,Wuhan 430074 ,Chi-na;2. Shanghai Center for Bioinformation and Technology , Shanghai 200235 , China)Received April 10 , 2006 ; revised July 3 , 2006a large amount of data. How to mine the biological meanings under these data is one of the main challenges in bioinformatics. Compared tothe pure mathematical techniques ,those methods incorporated with some prior biological knowledge generally bring better interpretations.Recently , a new analysis , in which the knowledge of biological networks such as metabolic network and protein interaction network is in-troduced , is widely applied to microarray data analysis. The microarray data analysis based on biological networks contains two main re-search aspects : identification of active components in biological networks and assessment of gene sets significance. In this paper , we brieflyreview the progress of these two categories of analyses , especially some representative methods.Keywords : biological networks , microarray , data analysis , subnetwork , gene set.The genomic expression , which has a profoundAccordingly,clustering or classification process-impact on identifying regulation relationship , genees 3-61 are introduced to describe the global change offunction prediction ,investigation of pathogenicthe expressions of many genes. Similar to previousmechanisms,drug discovery and so on,is one of thedifferential expressed gene identification , this globalresearch focuses in the post- genomic era. High-change cannot reflect the cellular response directly ei-throughput methodologies such as oligonucleotide andther. Moreover ,these two processes both dependcDNA microarrays , which can monitor global expres-very much on the particular algorithm designs. Dif-sion changes of thousands of genes , are thus of greatferent designs always lead to different results with lit-significance to the research of modern life science ,tle overlap. So far ,it is difficult to understand the bi-and have attracted extensive studies 1. Due to theological meaning of microarray data by pure mathe-valuable yet complicated information involved in the ex-matical approaches only.pression data , how to manage ,integrate and interpretthese data correctly is becoming the main challenge.Many subsequent studies ilustrate that incorpo-rating prior biological knowledge into microarray dataOriginally , pure mathematical approaches are in-analysis can effectively avoid the above problems oftroduced to help the microarray data analysis. In gen-pure mathematical methods 7- 9]. Biological knowl-eral ,it includes two aspects :( 1 ) identification of dif-edge includes all aspects covering sequence align-ferentially expressed genes according to sample class-ments,protein structures and biological functions ,es ;( 2 ) classifying samples or genes according to sim-which can be either generic or species -specific. Re-ilar expression patterns. Several statistic methods arecently , one kind of prior knowledge , biological net-adopted to determine differentially expressed genes ,work , which characterizes the annotation relation-including t-test , non- parametric test , Bayesian modelships among genes as network structure , is becomingand so ort2. These approaches produce a list of sig-very popular. The microarray data analysis based onnificant genes , which is however difficult to interpretbiological network integrates expression profiles withwithout any combined biological theme. As wenet中国煤化工uch as metabolic net-know , cellular processes are often carried out throughworlCNMHGnetworks. According tointeractions among many genes , thus the analysis ofthe wel-acccpicu assulllpliUll that the co-expressionsingle gene may miss some important information.among genes attributes to common biological func-¥Supported by the National Program on Key Basic Research Projects( No.2004CB518606 ) , the Fundamental Research Program of Shanghai Mu-nicipal Connmission of Science and Technology( No. 04DZ14003 ), and the National Key Technologies R&D Program of China( No.2005BA711A04 )*x To who万方数掘dence should be addressed. E mails : qluo@ mail. hust. edu. cn , yxli@ sibs. ac. cnProgress in Natural Science Vol.16 No. 12 2006 www. tandf. co. uk/ journals1243tions , the microarray data analysis based on biological2 Assessment of gene sets significancenetwork may identify the genes which perform a cer-tain function through their expression profiles. Fur-A preliminary stage of analyzing microarray datathermore ,it can be used to evaluate the significanceon biological networks is the analysis focusing on geneof a certain biological function by checking the ex-sets level. Gene sets denote the genes with correla-tions in biological networks , for example , participat-pression state of all related genes.ing in the same metabolic pathway , sharing the same1 Biological networksbiological function , the common chromosomal loca-tion or regulation. The main goal of gene sets analysisAs we mentioned in the previous section , biolog-is to determine whether a large number of genes fromical network represents the annotation relationships a-the gene sets are significantly regulated. These analy-mong genes or gene products , such as proteins. Gen-ses, by assessing the global differentially expressederally , there are three main sources of biological net-levels of the genes , can catch the fact that cellularworks :metabolic network , molecular interaction net-processes often affect sets of genes acting in concert ,works and Gene Ontology. The knowledge of thesethus avoiding the shortage of single- gene analyses.biological networks can be obtained from publicMoreover , the analyses at the gene set level can de-databases. Metabolic networks as well as chemical re-tect consistent but subtle changes in gene expression ,actions can be found in KEGG database. The Kyotowhich is also worth adverting to.Encyclopaedia of Genes and Genomes ( KEGG )( http ://www. genome. jp/ kegg/ ) 10J provides a ref-Curtis et al.t 18] have reviewed some methods oference knowledge base for linking genomes to biologi-gene set analysis , including hypergeometric probabili-ty ( binomial distribution 519],fisher exact test 20]cal systems and wiring diagrams of interaction net-works and reaction networks , which can be used forand x2 tes[211, Z scores 15] and odds ratid 22] , genemodeling and simulation as well as for browsing andset enrichment analysis 21 and so on , which are notretrieval. The Biomolecular Interaction Networkdiscussed in this paper any more. In the followingDatabase ( BIND )( http://bind. ca f11] and thesections,we will summarize some recent progressDatabase of Interacting Proteins( DIP )( http ://dip.made on some methods in gene sets significance as.doembi. ucla. edu ) 12] provide the molecular interac-sessment and their additional information is listed intions information. The BIND database stores interac-Table 1.tions and reactions arising from biopolymers( protein ,2.1 Pathway scoresRNA and DNA), as well as small molecules , lipidsand carbohydrates. The DIP database aims to supplyPathway is an intricate network consisting of thebinary protein- protein interactions. Gene Ontologychemical reactions and interacting molecules that per-(GO )( http ://www. geneontology. org )13] is aform specified biological functions. It is the key tocross-species , controlled vocabulary describing threeunderstanding how an organism reacts to perturba-domains of molecular biology : molecular function ,tions from its environment or internal changes. Ziencellular component and biological process. It is specif-et al. have presented an approach to evaluate path-ically intended for annotating gene products and inde-ways by scoring the gene expressions from conspicu-pendent of any biological species. Besides the databas-ousness , synchrony and combined effect 24]In thees of these three biological networks ,GenMappconspicuousness score ,a normal distribution is used to( http ://www. genemapp. org ) 145 allows visualiza-model the expression change. The conspicuousnesstion of gene expression data on alliance for cellularscoreof a gene is valued by the expression levels un-signaling , BioCarta , EcoCyc , MetaCyc , KEGG andder all conditions , while the conspicuousness score ofPathDB pathways , associated with analysis tools suchthe score of the genesas MAPPFindef15]. The TRANSFAC ( http //incl中国煤化工son' s crelaion offwww. gene- regulation. de/ ) 16 J database providescientMYHC N M H Gsion similarity in syn-transcription factors and their DNA-binding sites andchrony score. The synchrony score of a gene is calcu-profiles. Pfam ( http ://www. sanger. ac. uk/Soft-lated as the average correlation coefficient to otherware/Pfam/ ) 17] is a database of protein families rep-genes in the pathway it belongs to , while the syn-resented by, multiple sequence alignments and hiddenchrony score of a pathway is the average of the geneMarkov市衣熬搪scores. The combined scoring function is a modified1244www. tandf. co. uk/journals Progress in Natural Science Vol. 16 No. 122006form of synchrony score by replacing the standard de-modules( gene sets ) with cancer types ( experimentviation of particular genes with the holistic standardsets) through the concepts of gene enrichment anddeviation,which can scale the covariance betweengene set enrichment. As the reference pointed out ,genes in a union.the function module analysis method can characterizethe modules shared across multiple tumor types ,Kurhekar et al. have made some improvementswhich may be related to general tumorigenic process-on Zienet al. s' pathway scoring 251. Different fromes ,and modules specific to particular tumors. Also ,the conspicuousness score of Zien et al. , the activityeach cancer type can be described as a particular com-score of Kurhekar et al. counts in the active genebination of modules.numbers in a pathway. The score will be higher ithere are more genes over-expressed or under-ex-2.3 Gene set enrichment analysispressed in the pathway. The coregulation score mea-Another important aspect of gene set analysis issures the slopes among genes in a pathway , avoidingthe problem that pairwise correlations cannot capturethe gene set enrichment analysis( GSEA ) , which wasthe simultaneous co- expression of all genes in a path-proposed by Mootha et al. in 2003- 231. GSEA deter-way.' To incorporate the structure of a pathway intomines whether prior defined gene sets are enriched atthe analysis ,a cascade score is introduced to measurethe top of a list of genes ordered by the expressionthe interaction levels among active genes in a path-difference( signal to noise ratio , SNR ) between twoway. Each path in a pathway is scored by the activeclasses. A normalized Kolmogorov- Smirnov statistic isgene numbers occurred in it , while the highest scoreused to define the enrichment score. For a gene set Scontaining G members and a gene list R,.. ,Ry or-is assigned as the cascade score of this pathway.dered by the differential expression levels ,the score isThe analysis of pathway scores completely con-X=-√N-CGif R; does not belong to S ,and X;siders pathways in three aspects. Nevertheless , previ-ous approaches do not synchronize these three aspects/N-Gif R; belongs to S. The enrichment scorewell. The authors always computed the three scoresseparately , where different pathways might gain dif-of a pathway is the running sum of X; for N genesferent significant scores according to the three as-with the maximum absolute value.pects. For example , pathways A and B are the mostsignificant pathways by conspicuousness score ; path-In subsequent studies , some statistical problemsways C and D are the most significant pathways byabout GSEA procedure are concerned by Damian etsynchrony score ; pathways E and F are the most sig-al.l27. The most interesting one is that the enrich-nificant pathways by combined score. Hence the in-ment score will be influenced by the size of a geneterpretation of results is challengeable. Because olset. Another limitation is that the GSEA process can-such inconsistency , a uniform criterion is definitelynot treat the top genes in the gene list with the sameneeded to involve these three aspects.significance as the bottom genes of the list , where thetop genes and bottom genes represent up-regulated2.2 F unction module analysisand down- regulated respectively and should be equallysignificant for analysis. To avoid these problems ,It is well known that in most cases only a subsetSunramanian et al. introduced an enhanced GSEA ,inof the whole gene set may contribute to its expressionwhich weighted score is applied to each step instead ofsignature , and different sets may have similar signa-tures in the same experiment. To extract the coreequal score 281. Here X;=、is the score for genesparts of each set , function module analysis is intro-duced. Segal et al. firstly adopted this method to i-included inset S , while X;=N_C is the score fordentify conditional active expression modules in dif-gen中国煤化工e r; is the rank for geneferent types of cancer 261. There , the significance of ai an:fYHCNMHGllgenesinS.Theen-function module( a set of genes ) is determined by therichment score is still the maximum running sum de-fraction of active genes in it. The significance of aviating from zero. The improved version of GSEA canmodule for specified cancer type is determined by theidentify gene sets that have enriched subsets both atfraction of experiments in which the module is active.the top and bottom of the gene lists , which will be ig-The瓦流数据module analysis combines functionnored by the over penalization of bottom genes in theProgress in Natural Science Vol.16 No. 12 2006 www. tandf. co. uk/ journals1245original GSEA algorithm.bers in the sets are sufficiently large( 10 at least ).Compared with GSEA , PAGE can obtain more sig-Kim et al. developed a parametric gene set en-nificant gene sets whose p-value is lower than that ofrichment analysis ( PAGE ) by normal distribu-GSEA. It might be caused by that GSEA uses per-tiorf291. Their theoretical base is the Central Limitmutation of original data set ( 1000 times ) to getTheorem : the distribution of the average of randomlybackground distribution of each enrichment score andsampled n observations tends to follow normal distri-evaluates the significance of each enrichment scorebution as the sampling size n becomes larger , no mat-from the permutated data set , thus the best p-valueter whether the parent distribution is normal or not.in GSEA cannot be smaller than 0.001 ( which is 1As a result ,the statistical significance of gene sets canover 1000 ).be assessed by normal distribution if the gene num-Table 1.Summary of some methods for gene sets significance assessmentMethodsAnnotationsDatabasesData sets .StatisticalGene OntologyGd13]Arabidopsistests157,15 2021]Metabolic networkKEGG10]YeastGenMappGenMapp 14]MouseBiocartaHumanPfamPfanf17]GEPASMIPSMIPS 30]SMARTSMART[31]Pathway soore5 24 ] Glycolysis and gluconeogenesis pathway KEGG 10]6178 Saccharomycesin Saccharomycescerevisiae ORFs during 18 time points 32-341Cerevisiae described in[ 32 ]Metabolic pathwayscerevisiae ORFs during 4 time courses 32- -34]Function module14145 genes in 1975 arrays spanning 17analysis 26]Metabolic pathwaycancer categories'GenMapp14]Tissue speific expressed gene set 35]P-clusters5 36]Gene set enrichment Gene Ontology22000 genes in skeletal muscle biopsy sam-analysis 23 28 29]GenMap I4]ples from 43 malef 23]12625 probes on HGU95Av2 chip for 50Signaling pathwaySPAD'40]NCI60 cell lines, 17 normal and 33 p53Signaling gatewayAfCS-Nature Sig- mutations 47.Transduction informationnaling Gateway41] 12625 probes on HGU95Av2 chip for 24Protein referenceSTKE42]acute lymphoid leukemia and 24 acuteSigma- Aldrich pathwaysHuman protein ref- myeloid leukemidt 481Human cancer genome anatomy infor-erence database] 12625 probes on HGU95Av2 chip for 62mationSigma- Aldrielt 44lung adenocarcinomasGene arraysRegulatory-Motifs in the promoter reSupperArrayf 45]7129 probes on HU6800 chip for 86 lunggion537]CAGECancer 46]adenocarcinomt 50]Neighborhoods around cancer associatedgenesartis normal tissue compendiuntNovartis carcinoma compendiunt 38 ]Global cancer mapf 39]中国煤化工3 Identification of active components in bio-sion! and mapped to biolog-logical networksicalMHC N M H G asumption that geneswith shared function will be activated together andRecently , another direction of microarray datathus show correlated expression profiles,the connect-analysis based on biological networks , namely identi-ed regions ( subnetworks ) inside the network , whichfying active_ components in biological networks ,be.show significant changes over particular conditions ,comes very pft原r. In these analyses , gene expres-can be selected by applying certain optimization algo-1246www. tandf. co. uk/journals Progress in Natural Science Vol. 16 No. 122006rithms. In this sense , the main effort in this topic fo-one of the results. .cuses on the optimization algorithms as the followingPatil et al.[52] have adopted this approach tcsections illustrate. Identification of active componentsin biological networks is supposed to help understand-metabolic network.' I he complete metabolic networking of the underlying mechanisms governing the ob-is represented as a bipartite graph. In this graph ,served changes in gene expression. Some methods inmetabolites and enzymes are represented as nodes ,identification of active components in biological net-while interactions between them are representedaworks and their additional information are listed inedges. Another unipartite graph is constituted by en-Table 2.zymes ,and any two enzymes sharing a common sub-strate in the corresponding reactions are connected toTable 2. Summary of some methods for identification of activeeach other. Scoring and sorting genes with their ex-components in biological networkspression data generate reporter metabolites originatedMethodsAnnotations DatabasesData setsfrom metabolic graph. Simulated annealing similarSimulatedProtein-BIND11] 997 mRNAs responding towith Ideker' s process identifies the significantly cor-annealprotein in- TRANS- 20 systematic perturba-ing[51] .teractionsFAC16] tions of the yeast galactrelated subnetworks.ose- utilizationDNA inter-pathway S1]The simulated annealing method has severalactionsdrawbacks. Firstly , this method cannot guarantee toExpectation Protein-DIp12]3589 Sucharomnycesfind the optimal subnetwork. Theoretically , if themaximiza- protein ircerevisiae genes with 173number of iterations is large enough , the final solu-tiort 531micrarrays 65]3589 Saccharomycestion will be the global optimum. Thus the number ofcerevisiaeiterations is always set very large ( for example ,micrarrays 3100000 by Ideker et al. ) to assure the quality of sub-KernelPathwaysKEGG10] 6178 Sacharomycesnetworks , which will lead to great computational de-cererisiae ORFs during 18analysis 56]time points 32- -341mand. Secondly ,the simulated annealing requiresGraph-iter- Gene On- GO13]6178 Saccharomycesrelatively complex parameter estimation. It is difficultative group tologycerevisiae ORFs during 18to get appropriate parameters. Currently , the param-analysis 63] MetabolicProt 66]time point532- -34]networketers are mainly set empirically.WaveletMetabolicEco-4345 E coli. ORFsof 433.2 Expectation maximizationransCy[67]samples 68]form[64]Segal et al. have proposed another approach todetect gene groups whose expression profiles are cor-3. 1 Simulated annealingrelated and protein products interacted 531. The parti-tions are modeled with relational Markov networks ,Algorithm based on simulated annealing is intro-which contain two components : one for the expres-duced by Ideker et al.(511. In their algorithm , thesion data and the other for the protein interaction da-genes are firstly scored by their differential expressionta. Gene expression profiles are modeled using Naivelevels. The score of a subnetwork is defined as adjust-Bayes models. In this model , genes are clustered intoed average of scores of genes inside. After Monte Car-disjoint classes. The conditional probability for thelo random approach ,the score of each pathway isattribution of one experiment to a certain gene benormalized. High score indicates active biological sub-longing to a certain cluster is assumed to follow Gaus-network. Because the problem of finding maximalsian distribution. The probabilistic model for proteinsubnetwork is NP-hard , the simulated annealing ap-interaction data is based on the assumption that inter-proach is used to look for the high score subnetworks.acti中国煤化工in the same pathway.The initialization of simulated annealing randomlyThields are used to modelsets each gene node in the network as active/ inactivetheYHC NM H Gork. Finally ,a unifiedstate , where the active genes form an initialized sub-model integrating gene expression model and proteinnetwork. In each step , the score of the current sub-interaction model can then be naturally defined asnetwork is calculated by randomly toggling the statetheir product.of a gene. in the network. After sufficient iterations ,the subnetwtR振th the highest score is exported asSome parameters , such as the ones in the modelProgress in Natural Science Vol.16 No. 12 2006 www. tandf. co. uk/ journals1247probabilistic distributions , need to be estimated in thefunctions and take correlation analysis. Furthermore ,unified model. The parameters are estimated by Ex-the form of correlations is not limited to two vari-pectation Maximization ( EM ) algorithm. The EMables5 60]. An attempt of multiple kernel functions hasprocedure iterates between Expectation ( E ) andbeen adapted for operon detection in bacterialMaximization( M ) steps. In the E step , the unifiedgenomes 61].probability is calculated with the current parameters.In the M step , parameters are estimated so as toThe kernel function analysis is efficient but really .maximize the probability obtained in the E step. .complicated. With the increase of factors , the dimen-When the EM algorithm converges , each gene is as-sion and complexity of analysis will rise greatly. Be-signed to the groups with the maximum conditionalsides , the integration of multiple kernel functions isalso a problem. Currently the solution is to form aprobability.convex combination of kernels by setting nonnegativeDespite the success of this method in searchingweight to each kernel. Lanckriet et al. gave an opti-for functional gene groups , some limitations still re-mized algorithm to estimate the kernel weights bymain. The main one is that the method is based onsemidefinite programming , yet the process is serious-probabilistic models , thus relies heavily on the asly time and space consuming 62 I.sumption that the data set fits a particular distribu-tion. This may not be true in many practical cases.3.4 Graph-iterative group analysisFor example , Yeung et al.54J studied three gene ex-The graph-iterative group analysis( GiGA) bypression data sets with several data transformationsBreitling et al. has also provided a statistic way toand found that the data sets all fit the Gaussian modeldentify active subgraphs in the biological knowledgepoorly. Another example is that , the model con-graplt 63. Genes that share a Gene Ontology annota-strains each gene to be in exactly one group , whichtion or participate in a metabolic pathway are con-cannot capture the biological fact that many genenected to build the graph. At the same time 1 a rankproducts participate in more than one biological pro-list of genes sorted by differential expression is pro-cess55].vided and each node is ranked according to the geneallocated to it. Then local minima are identified in the3.3 Kernel function analysisgraph as the nodes with a lower rank than all of theirKernel function analysis method is introduced bydirect neighbors. The local minima nodes are consid-Vert et al.[561 There , the gene networks and expres-red to be significant centers of the subsequent sub-sion data are transformed into two kernel functions ,networks. From each local minima , a subnetwork isand consequently active pathways are extracted byextended by including the neighboring node with theperforming a regularized form of canonical correlationnext highest rank( m ) and ,if present , all adjacentanalysis.' The correlation between two different ele-nodes of ranks equal to or smaller than m. To assessments , nodes in pathway graph and expression pro-the significance of each extension ,a p-value is calcu-files , are used to assess the relationship between genelated as the probabilty of observing n genes whoseexpression and pathway function information. Theranks are equal to or better than m from all N genes.main goal of this algorithm can be simplified as find-The extension process ends when no more nodes caning vectors,which denote the variations among genebe included. The subnetwork with the highest scoreexpression profiles as large as possible and the expres-is output as ideal relevant regions.sion features among adjacent genes in a pathway asIn GiGA , the calculation can only deal with onecontinuous as possible.' These vectors are character-ized as linear combinations of expression profiles. Bycondition at one time ,which means that it cannotnrces of gene expressionencoding the expression data and pathways into twokernel functions , the problem is solved by canonicalthro中国煤化工The method will notperfcTYHC N M H Geriments , especally incomponent analysis.the case of time series dataset. Another disadvantageThe kernel function analysis has been generalizedof GiGA is that the genes are ranked by the log valueto many kinds of data,including aminoacid se-of expression levels , from positive to negative. How-quences S7] ,phylgenetic profiles 58 J and promoter re-ever ,in fact ,a biological pathway contains both in-gion5 59]万有数港se data are represented as kernelduced expressed genes and repressed expressed genes.1248www. tandf. co. uk/journals Progress in Natural Science Vol. 16 No. 122006In a pathway , there are both up- regulated and down-numbers of pathways that would be expected byregulated relationships among genes. Thus the opera-chance to have a notable p-value. Usually , there aretion that ranks genes from positive to negative cannotboth false- positive and false- negative rate that shoulddetect the pathways including both remarkable in-be calculated separately.duced genes and repressed genes 1 whereas these genesFor the second category of analysis , active com-and their regulation relationships are usually extreme-ponents identification , there are no standard valida-ly important.tion criteria. The commonly treatment compares re3.5 Wavelet transformsulted active components with existing evidence , forexample , known regulatory circuits , gene clusters inEarly this year , Konig et al. published a novelother literatures and so on. Besides , there are also .method to discover central components of metabolicother validation criteria for some special approaches.networks64J,which identified distinct expressionIn expectation maximization method of Segal e1patterns from E. coli under both the aerobic andal.!531,three principles are used to evaluate theanaerobic conditions using wavelet transform. In thislearned model : prediction of removed interactionsmethod , the metabolic network is represented as afunctional annotations enrichment of pathways andgraph , with the enzymes as edges and metabolites ascoverage of protein complexes. It is generally agreednodes. After applying a clustering method ,thethat a good molecular pathway should have threemetabolic graph is grouped into several sub-graphs.properties :( 1 ) stable when a small portion of inter-Expression data of all samples are mapped and fea-actions are taken away ;( 2 ) coherent in functionaltures for every sample are gained by performing Haar-annotations and( 3 ) as many as possible protein com-wavelet transformations on the gene expression pat-plexes are assigned to the same pathway 531. Theseterns of sub- graphs. The most significant features arevalidation criteria are only applied in expectationextracted by modified t-test and SVM process. Final-maximization method and might be adapted to otherly , the sub- graphs containing the most significant fea-procedures.tures are considered as relevant ones that can repre-sent the adaptation of the cells to changing environ-5 Comparison of methodsmental conditions.In order to examine the overlaps and differencesThis wavelet transform method combines somein the results of various methods , comparisons amongmachine learning approaches and elucidates relevantdifferent methods are necessary. Curtis et al. havesub-graphs by testing all possible patterns within thecompared some methods for gene sets assessment 18 J.metabolic network. However , the sub- graphs are ex-They found that binomial distribution and z -scorestracted in advance by graph clustering method , whichhave similar results while more gene sets are shown tois based on the topology of the metabolic network on-be downregulated by GSEATI8. Here, we try to .ly. This kind of initialization limits the selection ofcompare two latest methods of active componentssub-graphs as there are many other important factorsdentification : GiGAE 63 ] and wavelet transfornt 64].need to be considered to extract sub. graphs besidesWe used the microarray dataset in wavelet trans-topological connections.form. The dataset is from Covert et a.[68], which4 Validation problemdenotes E. coli genome expression under aerobic andanaerobic conditions. Knoig et al. have normalizedDifferent methods may lead to different results.the dataset and selected 43 hybridizations of one wild-How to assess the significance of the results is a seri-type sample and six strains with knockouts of keyous task. For the first category of analysis , gene setstran_ the oxygen responsesignificance asssment , the most widely used valida-△ai中国煤化工△axyR, OsourS andtion method is to calculate a p-value through multipledoulMHC NM H G] Signal-to- noise ratiorandomized data , which is used to verify whether the( SNR ) in GSEA28J is introduced to produce theanalysis result is more significant than expected. Theranked list of reactions for GiGA( the genes are rep-false discovery rate( FDR ) is another guideline forresented as reactions in wavelet transform , thus thegene sets significance evaluation. FDR can predict thereactions instead of genes are analyzed in GiGA fornumber oP 据discovered results, for example ,comparing purpose ). We also used other rank proce-Progress in Natural Science Vol.16 No. 12 2006 www. tandf. co. uk/ journals1249dures such as standard deviation for GiGA and foundobtain biological significant results. The structure ofthat the results from different rank procedures agreebiological networks intuitively reflects the annotationclosely,especially in the most significant subgraphs.relationships among genes. For example ,neighbornodes in the biological networks ( nodes with connec-We compared the subgraphs extracted from Gi-tions ) represent genes with annotation correlations ,GA and wavelet transform. Possibly because waveletwhich are convenient to next analysis. Furthermore ,transform allows overlapping subgraphs while GiGAthe microarray data analysis based on biological net-dose not , there are many more numbers of subgraphsworks involves contributions of multiple genes,whichresulted from wavelet transform than from GiGA.capture the biological fact that cellular processes oftenOnly one subgraph ( formate metabolism ) exists inact as interactions of many genes.the results of both GiGA and wavelet transform ,which is the most significant subgraph in both meth-A possible improvement deserving attention isods. However , there are many overlaps between thethat all kinds of clues should be integrated regularly.results of GiGA and the 40 first ranking reations list-For example,in pathway scores , the expression aced in wavelet transform method. The results of com-tivity level , the expression similarity and structureparison are summarized in Fig. 1. This may supportcharacter of pathways are all considered , yet no crite-the concept that wavelet transform is designed to i-ria with all these three properties are provided. In thedentify complex expression patterns that cannot besubnetwork identification methods , either expressionfounded straightforward 64].activity or expression similarity is judged,while thereare also no approaches incorporating both properties.DownregulatedUpregulatedAs an attempting ,we have incorporated gene expres-sion synchrony assessment to GSEA , which showsGiGAFirst ranking( First rankingreactionsbetter results than current GSEA method( manuscript|2|14424in preparation ).Another potential improvement is to add somedetails of biological annotations. For both categoriesFig.1. Comparison between the results of GiGA and the 40 firstof analyses , detailed biological information is usuallyranking reations listed in wavelet transform method. The figureshows the number of either downregulated or upregulated reactionsignored. The active biological subnetworks and func-in subgraphs by GiGA and first ranking reations list in wavelettional gene sets are considered as a whole and the genetransform ,as well as their overlaps. GiGA was performed on a generelationships inside the subnetworks and gene sets arelist , ranked by SNR.not exhibited. The subnetworks or gene sets analysis6 Discussions and prospectsis designed to avoid the limitations of single- geneanalysis. However , it is obvious that the detailed in-With the development of microarray technology ,formation will help to analyze and interpret results.more and more algorithms on genome-wide data areThese information may include the degree of nodesbecoming available. How to efficiently and thorough-( the number of edges connected to each node ) in bio-ly mine the knowledge under the expression data islogical networks , the structure of subnetworks , thethe most interesting challenge up to date. Incorporat-connections among active genes in subnetworks , theing biological knowledge seems to be the tendency inco-regulated regions in a subnetwork and so on. Suchexpression data analysis , as the biological knowledgeinformation may bring in some heuristic idea. For ex-can provide guidance to connect expression profilesample , in graph theory ,the degree of nodes repre-with biological functions. In this review ,we havesents the central level of the nodes in the network ,summarized one form of biological knowledge , biolog-thus integrating this character may obtain gene signif-ical networks , and two main research aspects for mi-ican中国煤化工tructure in addition tocroarray data analysis based on biological networks :i-exprdentification of active components in biological net-YHCNMHGIn addition , to the biological annotations , thereworks and assessment of gene sets significance.is so far no consistent standard for different annota-The microarray data analysis based on biologicaltion sources. Situations will probably arise such asnetworks_ integrates biological annotation knowledgesome expression data tend to show more significanceto expressiorh les , which show more potential toto Gene Ontology gene sets , while others show more1250www. tandf. co. uk/journals Progress in Natural Science Vol. 16 No. 122006tendency to KEGG gene sets. The function modulerefining process by Segal et al.[26] is perhaps the ini-Proceedings of the Eighth International Conference on IneligentSystems for Molecular Biology( ISMB).2000 ,93- -103.tiatory attempt for this problem. Moreover ,in the i-6 JiangD. ,Tang C. and Zhang A. Cluster analysis for gene expres-dentification of subnetworks , annotations from multi-sion data :a survey. IEEE Transsactions on Knowledge and Dataple sources could be thought to integrate together.Engineering , 2004 ,16( 11 ):1370-1386.That is to say , one might build a network involving7 Adryan B. and Schuh R. Gene Ontology based clustering of geneexpression data. Bioinformatics , 2004 ,20( 16 ):2851- -2852.all or some of protein interactions , metabolic informa-8 ChengJ. ,Cline M. , Marin J. et al. A knowledge-based cluster-tion,Gene Ontology categories , and regulatory rela-ing algorithm driven by Gene Ontology. Joumal of Biopharmaceu-tical Statistics ,2004 , 14( 3): 687- -700.tionships so on. Using this large network , not only .) FangZ. , YangJ. ,Li Y. X. et al. Knowledge guided analysis ofthe discovery of better subnetworks but also the cor-microarray data. Joural of Biomedical Informatics , accepted.relation analysis among different annotations can be10 Kanehisa M. , Goto S. , Hattori M. et al. From genomics toimagined potentially.chemical genomics: new developments in KEGG. Nucleic AcidsRes. ,2006 ,34( Database ise ):D354- -357.11 BaderG.D. , Betel D. and HogueC. W. BIND : the biomolecularFor further consideration , the two categories ofinteraction network database. Nucleic Acids Res. ,2003 ,31( 1 ):analysis can be integrated to some extent.' The princi-248- -250.ples used in two categories such as expression scores12 Xenarios I. ,Salwinski L. ,DuanX. J. et al. DIP ,the database ofcan be adapted to each other. Moreover , the resultsinteracting proteins ;a research tool for studying cellular networksof protein interactions. Nucleic Acids Res. , 2002 ,30( 1 ): 303- -from the second category of analysis can be validated305.by the first category of analysis , that is to say,if13 AshburmerM. ,Ball C. A. , BlakeJ. A. et al. Gene Ontology :some active subnetworks are obtained from certain ac-tool for the unification of biology. The Gene Ontology Consortium.Nat. Genet. ,2000 ,25( 1 ):25- -29.tive components identification methods ,then these14 Dahlquist K. D. , SalomonisN. , Vranizan K. et al. GenMAPP ,asubnetworks can be assessed by some gene sets assess-new tool for viewing and analyzing microarray data on biologicalment methods to verify whether the resulted subnet-pathways. Nat. Genet. ,2002 ,31(1 ):19- -20.15 Doniger S. W.,Salomonis N.,Dahlquist K. D. et al.works are significant or not.MAPPFinder: using Gene Ontology and GenMAPP to create aglobal gene-expression profile from microarray data. Genome Biolo-Last but not the least ,there are some exteriorgy ,2003 ,4 :R7.facts that will affect the analyses. As mentioned in16 Wingender E. ,ChenX. ,FrickeE. et al. The TRANSFAC sys-[ 18 ], there is no single repository for various gene i-tem on gene expression regulation. Nucleic Acids Res. ,2001 .29(1):281- -283.dentifiers , such as Affymetrix probe IDs , gene sym-17 FinnR. D. ,Mistry J. ,Schuster-Bockler B. et al. Pfam : clans , .bols , accession numbers and so on. Some of them areweb tools and services. Nucleic Acids Res. , 2006 , 34( Database is-too obsolete or redundant to map an enzyme or pro-sue):D247-251.tein to a microarray probe ID. Besides,there are18 Curtis R. K. ,Oresic M. and Vidal Puig A. Pathways to the anal-ysis of microarray data. Trends Biotechnol. , 2005 ,23( 8 ):429-large proportions of genes that have no function anno-435.tations yet , thus all the analyses are processed on a19 TavazoieS. ,HughesJ. D. ,Campbell M.J. et al. Systematic de-small subset of genes. Hopefully , the analysis of mi-trmination of genetic network architecture. Nat. Genet. , 19992X 3): 281- 285.croarray data with biological networks will be more20 ZeebergB. R. ,Feng W. , WangG. et al. GoMiner :a resourepowerful when the annotations become more complete.for biological interpretation of genomic and proteomic data.Genome Biology ,2003 ,4 :R28.AcknowledgementThe authors would like to thank21 KhatriP. , BhavsarP. ,BawaG. et al. Onto tools ian ensemble ofDr. Lu Qiang for helpful comments and suggestions.web-accessible,ontology-based tools for the functional design andinterpretation of high-throughput gene expression experiments.ReferencesNucleic Acids Res. ,2004 ,32 : W449- -W456.22 ChoiJ. K. ,ChoiJ. Y. KimD. G. et al. Integrative analysis ofDugganD. J. , Michael B. , Chen Y. et al. Expression profilingmuliple gene expresson profiles applied to liver cancer study.using cDNA microarrays. Nature Genetics, 1999 ,21( 1 Suppl):FEBS Lett. ,2004 ,565<1- -3):93-100.23r EriksonK., F. etal. P6C-! Efron B. and Tibshirani R. Empirical bayes methods and fale dis-中国煤化工oidaive poporylation arecovery rates for microarrays. Genet. Epidemiol. ,2002 ,23( 1 ):diabetes. Nat. Genet..70- -86.THCNMHG"3 EisenM. B. ,SpellmanP. T. ,PatrickO. B. et al. Cluster analy-sis and display of genome wide expression patterms. Proc. Natl. A-sion data with pathway scores. In :Proc. Int. Conf. Intell. Syst.cad. Sci. USA , 1998 , 95( 25 ): 14863-14868.Mol. Biol. ,2000 ,8 :407-417.GetzG. ,Levine E. and Domany E. Coupled two- way clustering25 Kurhekar M. P. ,Adak S. ,Jhunjhunwala S. et al. Genome wideanalysisg 号数icroarray data. Proc. Natl. Acad. Sci. USA,pathway analysis and visualzation using gene expression data. Pac2000 ,无互教579- -12084.Symp. Bicomput. ,2002 :462- -473.Progress in Natural Science Vol.16 No. 12 2006 www. tandf. co. uk/ journals125 126 Segal E. , FriedmanN. ,Koller D. et al. A module map showing49 Bhattacharjee A. , Richards W. G. , StauntonJ. et al. Cassifica-conditional activity of expression modules in cancer. Nat. Genet. ,tion of human lung carcinomas by mRNA expression profiling re-2004 ,36( 10 ): 1090- -1098.veals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci.27 Damian D. and Gorfine M. Statistial concerns about the GSEAUSA , 2001 ,98 24 ): 13790- -13795. .procedure. Nat. Genet. ,2004 ,36( 7 ):663.50 BeerD. G. ,KardiaS. L. ,HuangC. C. et al. Gene-expression28 Subramanian A. Tamayo P. , Mootha V. K. et al. Gene set en-profiles predict survival of patients with lung adenocarcinoma. Nat.richment analysis: a knowledge-based approach for interpretingMed. ,2002 ,88):816- -824.genomewide expression profiles. Proc. Natl. Acad. Sci. USA ,51 Ideker T. ,Orier O. ,Schwikowski B. et al. Discovering regulato-2005 , 102 43 ): 15545-15550.ry and signalling circuits in molecular interaction networks. Bioin-29 KimS. Y. and Volsky D. J. PAGE : parametric analysis of geneformatics ,2002 ,18( Suppl 1 ): S233- -240.set enrichment. BMC Bioinformatics , 2005 ,6 :144.52 Patil K. R. and Nielsen J. Uncovering transcriptional regulation of30 MewesH. W. , Frishman D. ,Mayer K. F. X. et al. MIPS :metabolism by using metabolic network topology. Proc. Natl. A-analysis and annotation of proteins from whole genomes in 2005.cad. Sei. USA , 2005 ,102 8):2685- -2689.Nucleic Acids Res. ,2006 ,34 :D169- -172.53 Segal E. , Wang H. and Koller D. Discovering molecular pathways31 LetunicI. ,CopleyR. R. ,PilsB. et al. SMART 5 : domains infrom protein interaction and gene expression data. Bioinformatics ,the context of genomes and networks. Nucleic Acids Res , 2006 ,2003 ,19(Suppl 1):264- -271.34( Database isue ):D257- -260.54 Yeung K. Y. ,Fraley C. ,Murua A. et al. Model: based clustering32 Spellman P. T. , SherlockG. , Zhang M.Q. et al. Comprehensiveand data transformations for gene expression data. Bioinformatics ,identification of cell cycle-regulated genes of the yeast Saccha-2001 ,17 10):977- -987.romyces cerevisiae by microarray hybridiation. Mol. Biol. Cell. ,55 Hvidsten T. R. , Lagreid A. and Komorowski J. Learmning rule-1998 ,9 12):3273- -3297 .based models of biological process from gene expression time pro-33 DeRisiJ. L. ,Iyer V. R. , Brown P. O. Exploring the metabolicfiles using Gene Ontology. Bioinformatics ,2003 , 199 ): 1116-and genetic control of gene expression on a genomic scale. Science ,11231997 ,278( 5338 ): 680- -686.56 VertJ. P. and Kanehisa M. Extracting active pathways from gene34 ChuS. ,DeRisiJ. , Eisen M. et al. The transcriptional program ofexpression data. Bioinformatics, 2003,19 ( Suppl 2 ): 11238-sporulation in budding yeast. Science , 1998 , 282( 5389 ): 699-705.57 Jakkola T.,Diekhans M. and Haussler D. A discriminative35 SuA.I. ,CookeM. P. ,Ching K. A. et al. Large- scale analysisframework for detecting remote protein homologies. J. Comput.of the human and mouse transcriptomes. Proc. Natl. Acad. Sci.Biol. ,2000,71- -2):95- -114.USA ,2002 ,99 7):4465- -4470.58 VertJ. P. A tree kernel to analyze phylogenetic profiles. Bioinfor-36 Segal E. , Shapira M. , Regev A. et al. Module networks : identi-matics , 2002 , 18 :S276- -S284.fying regulatory modules and their condition-specific regulators from59 Pavlidis P.,Furey T. S.,Liberto M. et al. Promoter region-gene expression data. Nat. Genet. ,2003 ,34 2 ): 166-167.based assfiationo of genes. In :Procedings of the Pacifie Sympo-37 SuA. I. , Wiltshire。, Batalov S. et al. A gene atlas of thesium on Biocomputing , 2001 ,151-163.mouse and human protein- encoding transcriptomes. Proc. Natl. A-50 FrancisR. B. and Jordan M. I. Kemel independent componentcad. Sci. USA , 2004 , 101( 16 ):6062- -6067.analysis. J. Machine Learning Res. ,2002 ,3 :1-48.38 SuA. I. ,WelshJ. B. ,SapinosoL. M. et al. Molecular casifi61 Yamanishi Y. ,VertJ. P. ,NakayaA. et al. Extraction of corre-cation of human carcinomas by use of gene expression signatures.lated gene clusters from multiple genomic data by generalized kernelCancer Res. ,2001 ,61 20 ): 7388- -7393.canonical correlation analysis. Bioinformatics, 2003,19 ( Suppl .39 RamaswamyS.,Tamayo P. , Rifkin R. et al. Multiclass cancer1 ):323- -330.diagnosis using tumor gene expression signatures. Proc. Natl. A-52 LanckrietG. R. ,DeB. T. , Cristianini N. et al. A statisticalcad. Sci. USA , 2001 ,98( 26):15149- -15154.framework for genomic data fusion. Bioinformatics, 2004 , 2040 Higashiku H. Signaling Pathway Database. http: //www. grt.( 16 ):2626- -2635.kyushu- u. ac. jp/ spad/ menu. html[ 1998- 10-16 ]63 Breiting R. , Amntmann A. and Herzyk P. Graph based iterative41 LiJ. ,Ning Y. ,Hedley W. et al. The molecule pages database.Group Analysis enbances microrray interpretation. BMC Bioinfor-Nature , 2002 , 4200 6916):716- -717.matics ,2004 ,5 : 100.42 Signal transduction knowledge environment. http: //stke. sci-54 Konig R. , SchrammG. ,Oswald M. et al. Discovering functionalencemag. org[ 2006 ] .gene expression patterns in the metabolic network of Escherichia13 Human protein reference database. www. hprd. org[ 2006 ]coli with wavelets transforms. BMC Bioinformatics, 2006,7:14 Sigma Aldrich. http ://www. sigmaldrich. com/ Area of Interest/Biochemicals/ Enzyme Explorer/ Key Resources. html[ 2006 ]65 Gasch A. P. , Wermer- Washburme M. The genomics of yeast re-45 SupperArray. www. superarray. com[ 2006 ]sponses to environmental stress and starvation. Funct. Integr. Ge46 Brentani H. CaballeroO. L. ,CamargoA. A. etal. The genera-nomics ,2002 ,2(4- -5):181-192.66 Bairoch A. ,Bockmann B. ,FerroS. et al. Swiss Prot : jgglingtranscriptome by using expressed sequence tags. Proc. Natl. Acad. .between evolution and stability. Brief Bioinform. , 2004 ,5( 1 ):Sei. USA ,2003 , 10023): 13418-13423.39- -5547 Olivier M. , Eeles R. , Hollstein M. et al. The IARC TP5357中国煤化工ama-CastroS. etal. EcaCye:database: new online mutation analysis and recommendations to三for Escherichia coli. Nucleicusers. Hum. Mutat. , 2002 ,196):607- -614.fHCN M H Gue):D334- -37.48 ArmstrongS. A. , StauntonJ. E. , Silverman L.B. et al. MLL58 CovertM. w. ,Knight E. M. ,ReedJ. L. et al. Integratingtranslocations specify a distinct gene expression profile that distin-high-throughput and computational data elucidates bacterial net-guishes a unique leukemia. Nat. Genet. ,2002 ,30( 1 ):41- -47.works. Nature , 2004 , 4296987):92- -96.
-
C4烯烃制丙烯催化剂 2020-11-22
-
煤基聚乙醇酸技术进展 2020-11-22
-
生物质能的应用工程 2020-11-22
-
我国甲醇工业现状 2020-11-22
-
JB/T 11699-2013 高处作业吊篮安装、拆卸、使用技术规程 2020-11-22
-
石油化工设备腐蚀与防护参考书十本免费下载,绝版珍藏 2020-11-22
-
四喷嘴水煤浆气化炉工业应用情况简介 2020-11-22
-
Lurgi和ICI低压甲醇合成工艺比较 2020-11-22
-
甲醇制芳烃研究进展 2020-11-22
-
精甲醇及MTO级甲醇精馏工艺技术进展 2020-11-22