Research Article

Identification of Differentially Expressed Genes and Prognostic Biomarkers of Breast Cancer Based on RNA-Seq and KEGG Pathway Network  

S.M. Zhang1 , Y. Gu1 , S.Y. Wu2 , Y. Kang3 , S. Liu3 , D. Zhang3
1. College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
2. Software College, East China University of Technology, Nanchang, 330013, China
3. The 2nd Affiliated Hospital, Harbin Medical University, Harbin, 150081, China
Author    Correspondence author
Cancer Genetics and Epigenetics, 2016, Vol. 4, No. 2   doi: 10.5376/cge.2016.04.0002
Received: 25 Jul., 2016    Accepted: 26 Jul., 2016    Published: 18 Oct., 2016
© 2016 BioPublisher Publishing Platform
This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Preferred citation for this article:

Zhang S.M., Gu Y., Wu S.Y., Kang Y., Liu S. and Zhang D., 2016, Identification of Differentially Expressed Genes and Prognostic Biomarkers of Breast Cancer Based on RNA-Seq and KEGG Pathway Network, Cancer Genetics and Epigenetics, 4(2): 1-9 (doi: 10.5376/cge.2016.04.0002)


The incidence of breast cancer is a complex biological process and multiple genes involved in the regulation. The gene expression differences of tumor cells between different patients’ determine the different treatment and prognosis. Therefore investigate the characteristics changes of breast cancer from a genetic level include identification of differentially expressed genes and prognostic markers will facilitate the development of appropriate and effective treatment. This subject obtained RNA-Seq Level 3 gene expression data from TCGA database, SAM algorithm was used to find differentially expressed genes. Next, the DAVID bioinformatics tool was employed to analyze the function of these genes, and obtained the significantly enriched pathways of these genes. Then gene interaction information was extracted from the pathways, KEGG pathway network was built by integrating these information, and the network topology were analyzed. The hub nodes extracted from the network were as candidate genes. Then the genes which have a significant impact on the survival were identified by using Cox proportional hazards regression model. And these genes were introduced into a multivariate analysis, the sample risk scores were calculated, according to which samples were divided into a high risk group and a low risk group. The survival difference between these two groups was analyzed using Kaplan Meier method, and logrank test was used to assess the statistical significant. By analyzing the gene expression dataset of TCGA database, a total of 5880 differentially expressed genes were found. Eight significant pathways were obtained by enrichment analysis. Then we used the interaction information of genes extracted from the pathways to build a KEGG pathway network, and 32 candidate genes were obtained from the network. Three significant genes (AARS, ADK, and ADORA2A) which have significant impact on the prognosis of breast cancer were identified by Cox proportional hazards. These three genes can be used as new prognostic biomarkers in breast cancer, provide guidance for the treatment of breast cancer. Wherein AARS has been proven associated with breast cancer risk. By multivariate analysis, this subject divided breast cancer into a high risk group and a low risk group, and there exits significant difference between them.

Breast Cancer; Differentially Expressed; KEGG Pathway Network; Gene; Prognosis


Breast cancer is a heterogeneous disease originated in the breast cancer tissue. It is the most common cancer in women, with up to 25 percent. The main risk factors for breast cancer are female gender and age (Reeder and Vogel, 2008). Other potential risk factors include genetic factors, infertility or lack of breastfeeding (Collaborative Group on Hormonal Factors in Breast, 2002), high levels of the hormone (Yager and Davidson, 2006), diet and obesity. Recent studies have shown that exposure to environmental pollution is also a risk factor for breast cancer (Clapp et al., 2008).


Minor role played by genetic susceptibility in most cases, however, in general, genetic factors are thought to be the main reason for 5-10% of patients (Gage et al., 2012). In less than 5% of patients, genetic factor plays a significant role by causing hereditary breast cancer - ovarian cancer syndrome. This includes patients who carry the BRCA1 and BRCA2 gene mutations. These mutations accounted for 90% of genetic factors and patients affected have a 60-80% risk of breast cancer (Gage et al., 2012). Other notable mutations also include P53, PTEN, ATM, PALB2 and CHEK2. Mutations that cause breast cancer have been confirmed by experiments association with estrogen exposures (Cavalieri et al., 2006). Abnormal growth factor signal of interaction between stromal cells and epithelial cells can promote the growth of malignant cells (Haslam and Woodward, 2003; Wiseman and Werb, 2002). And overexpression of lepton can promote cell proliferation and cancer in breast adipose tissue (Jarde et al., 2011).


The development, recurrence and metastasis of breast cancer is a multi-gene participation, multi-step process of synergy, inactivation of oncogenes or activation of tumor suppressor genes are important in tumor genesis (Kretschmer et al., 2011). Thus, differences in gene expression profile changes has been the hot of pathogenesis, recurrence and metastasis research of breast cancer (Manjili et al., 2012). Using microarray technology can effectively filter out the breast cancer-specific differentially expressed genes (Hayashida et al., 2011). Currently, there are some studies have screened differentially expressed genes in breast cancer patients. Grb14 was found highly expressed in 23.1% breast cancer, and this over-expression is associated with better disease-free and overall survival time, and can be as independent prognostic factor (Huang et al., 2013). Gasoline, a major actin-binding protein is widely expressed in normal cells, and is down regulated in a variety of cell types comprising the mammary epithelial (Baig et al., 2013).


In recent years, many studies committed to the research of breast cancer prognosis, the discovery of new prognostic markers of breast cancer will provide guidance for treatment. Staub et al have demonstrated in three tumor types, including colorectal cancer, breast cancer and gliomas, patients with low expression modules which were co-expression with WIPF1 generally have a good prognosis (Staub et al., 2009). Proteases as biomarkers in breast cancer prognosis and diagnosis has also been studied (Jarde et al., 2011). Song et al confirmed the conversion acidic coiled-coil protein (TACC3) gene overexpression was significantly correlated with poor prognosis in breast cancer, particularly in patients with tumor grade 2-4. And multivariate analysis showed that the expression of TACC3 are independent prognostic factor in breast cancer patients (Song et al., 2016). Zhu et al demonstrated that the histone methyltransferase enzyme hSETD1A was associated with triple negative breast cancer prognosis. They proved hSETD1A positive was correlated with short-term survival in patients with triple-negative breast cancer, suggesting that it can be used as prognostic biomarker of triple negative breast cancer (Zhu et al., 2016).


This study aims to use high-throughput gene sequencing data, discovery new breast cancer prognostic markers by screening differential expression gene and taking advantage of network. Firstly RNA-Seq data from The Cancer Genome Atlas (TCGA) database were used to find breast cancer prognostic biomarkers. Significance analysis of microarrays (SAM) was used to find the differentially expressed genes between tumors and normal breast samples. Then we obtained the significant Go terms and KEGG pathways the genes enriched. Next KEGG pathway network was constructed using the interaction between genes extracted from pathways and the topological properties of network were analyzed. The hub nodes in this network were selected as candidate genes. Next, the association between these candidate genes and patient survival was tested by Cox proportional hazards regression analysis and obtained the genes which have significant impact on the survival. Then these genes were introduced into the multivariate analysis, risk scores of samples were calculated and according to which samples were divided into high and low risk groups, whether there are significant differences on survival between the two groups were analyzed.


Materials and Method

Gene expression dataset of breast cancer

In recent years, the development of high-throughput sequencing technology makes it possible for researchers to understand the mechanism of diseases including cancer at the whole genome level. RNA-SeqV2 level 3 gene expression data of breast cancer were downloaded from TCGA database. It was processing according to the following steps: (1) The 0 value entries were instead using the minimum value of the data; (2) gene expression data were log2 standardization; (3) Because the samples contain multiple batches, sva R package was employed to remove batch effects using Combat function. Finally 1029 samples and 20,531 genes were included in this dataset, including 1099 breast tumor samples and 110 normal control samples.


Screening of differentially expressed genes

In this study, significance analysis of microarrays (SAM) was used to find the differentially expressed genes using samr R package. SAM method is a kind of statistical methods for analysis of microarray gene expression data and screening differentially expressed genes. It commonly uses permutation algorithm to estimate the false discovery rate (FDR), so as to control the error rate for multiple testing purposes. The data in this study included 1099 breast tumor samples, 110 adjacent normal samples and 20,531 genes. The threshold for differentially expressed genes selection was q < 1, foldchange > 2 or foldchange < 0.5. Obtain up- and down-regulated genes in breast tumors respectively.


Functional enrichment analysis for gene lists

Functional annotation and pathway enrichment analysis of differentially expressed genes were conducted using DAVID (The Database for Annotation, Visualization and Integrated Discovery) bioinformatics tools (Huang da et al., 2009b; Huang da et al., 2009a). DAVID provides a comprehensive set of functional annotation tools; it can be used for researchers to understand biological significance of a large set of genes. Here DAVID online tools were used for enrichment analysis of all the differential expression genes on the GO (Gene Ontology) function and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, control significant level FDR < 0.05.


Construction of KEGG pathway network

Genes and Genomes Kyoto Encyclopedia (KEGG) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. We downloaded KEGG pathway XML files from the KEGG website, using the XML R package to extract the interactions between these genes, integrated these interactions information and built a KEGG pathway network. Cytoscape (Smoot et al., 2011)software was used for analysis and visualization of this network. Cytoscape is an open source software platform that provides users to visualize molecular interaction networks and biological pathways.


Prognostic analysis

The correlation between gene expression and prognosis of breast cancer was tested by using the Cox proportional hazards regression model, adjusting for age, stage due to they have a certain degree of correlation with survival, they were used as covariates. Genes significantly affected survival were used in the subsequent multivariate analysis, the risk scores were calculated for each sample, namely the regression coefficients for each gene * the gene expression values in the sample and plus them. Survival analysis was conducted using the Kaplan Meier method, logrank method for testing the statistically significant. All statistical analyzes were carried out in the R open-source software, p < 0.05 was considered significant.



The differential expression genes in breast cancer and their function analysis

To find potential breast cancer-related genes, SAM algorithm was employed to find genes that were differentially expressed between 1099 breast tumor samples and adjacent 110 normal samples from TCGA database. Genes with q < 1, foldchange > 2 or foldchange < 0.5 were selected as the differential between cancer and normal. A total of 5880 differentially expressed genes were obtained, including 1715 upregulated and 4165 downregulated genes in cancer samples. As can be seen, most of the differential expression genes in breast cancer were changed downward.


DAVID online bioinformatics tool then was used for GO functional analysis of this 5880 differentially expressed genes, control FDR < 0.05. 227 significantly enriched Go terms were obtained, these genes mainly enriched in cell adhesion, biological adhesion, cell-cell signaling, behavior, regulation of system process, ion transport and many other biological processes. In addition, we also conducted KEGG pathway enrichment analysis for these genes, and obtained eight significant KEGG pathways, including neuroactive ligand-receptor interaction, cytokine - cytokine receptor interaction, retinol metabolism, drug metabolism, complement and coagulation cascades, metabolism of xenobiotics by cytochrome P450, steroid hormone biosynthesis and ECM-receptor interaction (Figure 1). A graph in figure 1 illustrated the results of GO functional analysis; we show only the most significant eight GO terms, B graph was the results of KEGG pathway analysis, showing all eight significant pathways. In order to obtain information on the interaction between genes in these pathways, we download the XML files of these eight pathways from the KEGG website, in which including information on gene interaction. The information of these eight pathways will be used for further analysis.


Figure 1 Gene Ontology function and KEGG pathway enrichment analysis. A. Gene Ontology function enrichment analysis, only the most significant eight go terms were shown in this figure. B. KEGG pathway enrichment analysis


Figure 2 KEGG pathway network


Figure 3 The probability distribution of nodes in the network


Construction and analysis of KEGG pathway network

Eight KEGG pathways which differentially expressed genes enriched were downloaded from KEGG website. XML R package was used to extract information of gene interactions. These relationships between gene pairs were integrated and a KEGG network was built. The network consists of 317 nodes, 384 edges (Figure 2). Degrees of nodes in the graph were represented by different node sizes. Next, the Cytoscape software was used for network analysis. The diameter of network was 11, the clustering coefficient was 0.016 and the distribution of node degrees was in line with power-law distribution (Figure 3).


Since the hub nodes in network tend to be more important, because their change may affect more genes which have interaction with them. Therefore, we selected degree ranked in the top 10% of all the nodes in the network as the hub node, contain 32 genes, these genes were seen as candidate genes associated with breast cancer.


Identification of prognosis molecular markers of breast cancer

Then we extracted the gene expression profiles of these candidate genes from the gene expression data. Finally we got the gene expression information of 23 in 32 genes; there are nine genes were missing in gene expression data. Next, the clinical information of each breast cancer sample was obtained from TCGA database, including the survival time and status, age, and stage information. Then Cox proportional hazards regression model was used to detect the association between genes and survival, adjusting for age and stage. In these 23 genes, there are three (AARS, ADK, ADORA2A) were significantly associated with prognosis of breast cancer (p < 0.05, Table 1).


Then these three potential candidate genes for breast cancer prognostic marker were introduced into the multivariate analysis, by Cox regression coefficient of the three genes and their expression values in each sample we obtained the risk score of each sample. According to the risk score samples were divided into a high risk group and a low risk group. The threshold for grouping was the median of risk scores. Then the Kaplan Meier method was used for survival analysis of these two groups, and draws their survival curve. The significant of difference between them was tested by logrank method and the result showed that the survival difference between these two groups was significant (p = 2.95e-05, Figure 4). The red curve represented the high risk group; and green curve is the low risk group. This indicated that these three genes can be used as prognostic marker for breast cancer and their further verification test is necessary.


Figure 4 Survival curves of high risk group and low risk group



Breast cancer is a heterogeneous cancer which is the world's highest incidence of women; it is a serious threat to women's health. The occurrence of breast cancer is a complex biological process which a number of genes involved and regulated. The various levels of gene expression in tumor cells of different individuals determined the difference in treatment and prognosis (Saad et al., 2010). So investigate characteristic changes of breast cancer from the gene level and detection of breast cancer prognostic biomarkers will play important roles in guiding breast cancer therapy.


With the development of high-throughput sequencing technology, making it easier for researchers understand the mechanism of occurrence and development of disease from the genome-wide level. RNA-Seq transcriptome sequencing technology is to sequence mRNA, smallRNA noncoding RNA and the like, to reflect their level of expression. TCGA is a free public database resource includes many types of cancers and a variety of data types. From which we can obtain a large number of RNA-Seq samples of breast cancer. We obtained 1099 breast cancer tumor samples and 110 normal control samples of RNA-SeqV2 Level 3 gene expression data from TCGA database, and analyzed the genetic characteristics of breast cancer on a genome-wide level to find differentially expressed genes in breast cancer as well as molecular markers for cancer prognosis.


Table 1 The result of Cox proportional hazards regression


SAM algorithm was used to screen differentially expressed genes between breast tumors and normal samples and 5880 genes were found, including 1715 upregulated and 4165 down-regulated. By GO functional enrichment analysis we found that these genes were mainly enriched in cell adhesion, biological adhesion, cell-cell signaling, behavior, regulation of system process, ion transport and many other biological processes. In addition, KEGG pathway enrichment analysis found that they are significantly enriched in neuroactive ligand-receptor interaction, cytokine - cytokine receptor interaction, retinol metabolism, drug metabolism, complement and coagulation cascades, metabolism of xenobiotics by cytochrome P450, steroid hormone biosynthesis and ECM-receptor interaction. By integrating gene interaction information from these pathways a KEGG pathway network was built and then the hub nodes of the network were extracted, 32 candidate genes were obtained.


The expression level of 23 genes were obtained from gene expression profiles, by Cox proportional hazards regression analysis, adjusted for age and stage, 3 genes were found had a significant effect on survival (p < 0.05) including AARS, ADK, ADORA2A. Wherein, AARS, alanyl-tRNA synthetase was responsible for protein synthesis and cell viability in a variety of processes involved in tumor genesis. It has been shown to play an important role in the development of breast cancer; it can modify individual susceptibility of Chinese patients and was associated with risk of breast cancer (He et al., 2015). This confirms the reliability of our results. However, association between the other two genes with breast cancer risk has not yet been studied; they may be new prognostic factors or risk genes of breast cancer, so the further analysis and experimental verification of them is necessary.


Furthermore, these three genes were introduced into multivariate analysis to calculate a risk score for each sample. According to the value of risk score sample was divided into a high risk group and a low risk group. Survival analysis was conducted between these two groups and we found that the two groups are different significantly in outcome. This indicates our subject can divide TCGA breast cancer patients into different prognostic groups; the method can also be used on other patient data to guide breast cancer treatment.



This work was supported by the Innovation and Technology special Fund for excellent academic leader of Harbin (grant number 2015RAXYJ051).



Baig R.M., Mahjabeen I., Sabir M., Masood N., Ali K., Malik F.A., and Kayani M.A., 2013, Mutational spectrum of Gelsolin and its down regulation is associated with breast cancer, Dis Markers, 34: 71-80


Cavalieri E., Chakravarti D., Guttenplan J., Hart E., Ingle J., Jankowiak R., Muti P., Rogan E., Russo J., Santen R., and Sutter T., 2006, Catechol estrogen quinones as initiators of breast and other human cancers: implications for biomarkers of susceptibility and cancer prevention, Biochim Biophys Acta, 1766: 63-78


Clapp R.W., Jacobs M.M., and Loechler E.L., 2008, Environmental and occupational causes of cancer: new evidence 2005-2007, Rev Environ Health, 23: 1-37


Collaborative Group on Hormonal Factors in Breast C., 2002, Breast cancer and breastfeeding: collaborative reanalysis of individual data from 47 epidemiological studies in 30 countries, including 50302 women with breast cancer and 96973 women without the disease, Lancet, 360: 187-195


Gage M., Wattendorf D., and Henry L.R., 2012, Translational advances regarding hereditary breast cancer syndromes, J Surg Oncol, 105: 444-451


Haslam S.Z., and Woodward T.L., 2003, Host microenvironment in breast cancer development: epithelial-cell-stromal-cell interactions and steroid hormone action in normal and cancerous mammary gland, Breast Cancer Res, 5: 208-215


Hayashida T., Jinno H., Kitagawa Y., and Kitajima M., 2011, Cooperation of cancer stem cell properties and epithelial-mesenchymal transition in the establishment of breast cancer metastasis, J Oncol, 2011: 591427


He Y., Gong J., Wang Y., Qin Z., Jiang Y., Ma H., Jin G., Chen J., Hu Z., Guan X., and Shen H., 2015, Potentially functional polymorphisms in aminoacyl-tRNA synthetases genes are associated with breast cancer risk in a Chinese population, Mol Carcinog, 54: 577-583


Huang Da W., Sherman B.T., and Lempicki R.A., 2009a, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Res, 37: 1-13


Huang Da W., Sherman B.T., and Lempicki R.A., 2009b, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat Protoc, 4: 44-57


Huang O., Jiang M., Zhang X., Xie Z., Chen X., Wu J., Liu H., and Shen K., 2013, Grb14 as an independent good prognosis factor for breast cancer patients treated with neoadjuvant chemotherapy, Jpn J Clin Oncol, 43: 1064-1072


Jarde T., Perrier S., Vasson M.P., and Caldefie-Chezet F., 2011, Molecular mechanisms of leptin and adiponectin in breast cancer, Eur J Cancer, 47: 33-43


Kretschmer C., Sterner-Kock A., Siedentopf F., Schoenegg W., Schlag P.M., and Kemmner W., 2011, Identification of early molecular markers for breast cancer, Mol Cancer, 10: 15


Manjili M.H., Najarian K., and Wang X.Y., 2012, Signatures of tumor-immune interactions as biomarkers for breast cancer prognosis, Future Oncol, 8: 703-711


Reeder J.G., and Vogel V.G., 2008, Breast cancer prevention, Cancer Treat Res, 141: 149-164


Saad E.D., Katz A., and Buyse M., 2010, Overall survival and post-progression survival in advanced breast cancer: a review of recent randomized clinical trials, J Clin Oncol, 28: 1958-1962


Smoot M.E., Ono K., Ruscheinski J., Wang P.L., and Ideker T., 2011, Cytoscape 2.8: new features for data integration and network visualization, Bioinformatics, 27: 431-432


Song H., Liu C., Shen N., Yi P., Dong F., Li X., Zhang N., and Huang T., 2016, Overexpression of TACC3 in Breast Cancer Associates With Poor Prognosis, Appl Immunohistochem Mol Morphol,


Staub E., Groene J., Heinze M., Mennerich D., Roepcke S., Klaman I., Hinzmann B., Castanos-Velez E., Pilarsky C., Mann B., Brummendorf T., Weber B., Buhr H.J., and Rosenthal A., 2009, An expression module of WIPF1-coexpressed genes identifies patients with favorable prognosis in three tumor types, J Mol Med (Berl), 87: 633-644


Wiseman B.S., and Werb Z., 2002, Stromal effects on mammary gland development and breast cancer, Science, 296: 1046-1049


Yager J.D., and Davidson N.E., 2006, Estrogen carcinogenesis in breast cancer, N Engl J Med, 354: 270-282


Zhu Y., Bai K., Yu J., and Guo M., 2016, Association Between Histone Methyltransferase hSETD1A and Prognosis in Patients With Triple-Negative Breast Cancer After Surgery: A Retrospective Study in the Chinese Female Population, Medicine (Baltimore), 95: e3783

Cancer Genetics and Epigenetics
• Volume 4
View Options
. PDF(408KB)
. Online fPDF
Associated material
. Readers' comments
Other articles by authors
pornliz suckporn sex videos bbw mom xxx big fucking arabin porn videos teen gril sex video riding hard cock woman hard vagina . S.M. Zhang
. Y. Gu
. S.Y. Wu
. Y. Kang
. S. Liu
. D. Zhang
Related articles
. Breast Cancer
. Differentially Expressed
. KEGG Pathway Network
. Gene
. Prognosis
. Email to a friend
. Post a comment