TCMGeneDIT: a database for associated traditional Chinese medicine, gene and disease information using text mining
© Fang et al. 2008
Received: 10 May 2008
Accepted: 14 October 2008
Published: 14 October 2008
Skip to main content
© Fang et al. 2008
Received: 10 May 2008
Accepted: 14 October 2008
Published: 14 October 2008
Traditional Chinese Medicine (TCM), a complementary and alternative medical system in Western countries, has been used to treat various diseases over thousands of years in East Asian countries. In recent years, many herbal medicines were found to exhibit a variety of effects through regulating a wide range of gene expressions or protein activities. As available TCM data continue to accumulate rapidly, an urgent need for exploring these resources systematically is imperative, so as to effectively utilize the large volume of literature.
TCM, gene, disease, biological pathway and protein-protein interaction information were collected from public databases. For association discovery, the TCM names, gene names, disease names, TCM ingredients and effects were used to annotate the literature corpus obtained from PubMed. The concept to mine entity associations was based on hypothesis testing and collocation analysis. The annotated corpus was processed with natural language processing tools and rule-based approaches were applied to the sentences for extracting the relations between TCM effecters and effects.
We developed a database, TCMGeneDIT, to provide association information about TCMs, genes, diseases, TCM effects and TCM ingredients mined from vast amount of biomedical literature. Integrated protein-protein interaction and biological pathways information are also available for exploring the regulations of genes associated with TCM curative effects. In addition, the transitive relationships among genes, TCMs and diseases could be inferred through the shared intermediates. Furthermore, TCMGeneDIT is useful in understanding the possible therapeutic mechanisms of TCMs via gene regulations and deducing synergistic or antagonistic contributions of the prescription components to the overall therapeutic effects. The database is now available at http://tcm.lifescience.ntu.edu.tw/.
TCMGeneDIT is a unique database that offers diverse association information on TCMs. This database integrates TCMs with biomedical studies that would facilitate clinical research and elucidate the possible therapeutic mechanisms of TCMs and gene regulations.
Traditional Chinese Medicine (TCM), a complementary and alternative medical system in Western countries, has been used to treat various diseases over thousands of years in East Asian countries. The fundamental principles of TCM are based on the Yin-Yang doctrine, the symbolic way of designating opposing forces, and the five element theory that everything in the Universe is dominated and balanced by the five elements, wood, fire, earth, metal and water . The therapeutic mechanism of TCM focuses on enhancing human body's resistance to diseases by improving the inter-connections among self-controlled systems and integrating the human body with the environment . The practice of TCM involves physical therapy such as acupuncture and chemical therapy using materials originating from plants, minerals and animals, while TCM natural products may comprise one or more herbs in the form of decoctions [1, 3]. In recent years, many herbal medicines have been reported to be associated with various symptoms or diseases and may exhibit a variety of effects through regulating a wide range of gene expressions or protein activities, such as the anti-tumor effects through the induction of the DR3 and DR4/5 death receptors in human monocytic leukemia cells , the anti-inflammatory potential through the inhibition of cytokine, iNOS and COX-2 expression via the NF-kappaB pathway , the hepatoprotective effect through the activation of pregnane X receptor in rats , the expression of bcl-2 oncogene in gastric precancerous lesions , the up-regulation of proapoptotic Bax expression in human bladder carcinoma T24 cells , the anti-aging effect on the cerebral gene expression levels in the senescence-accelerated mouse prone 8/Ta , and the alteration of gene expressions of hepatic stellate cells against hepatic fibrosis . Therefore, studies about the connections between the TCMs, genes and diseases are emerging and important. In this study, TCM primarily means the natural products developed from plants and animals.
The primary goal of text mining is to retrieve knowledge hidden in text and to present the distilled knowledge such as associations, patterns to users in a concise form . With the rapid increase of available TCM data, there is an urgent need to explore these resources effectively from the large quantity of literature . Over the past few years, several studies have been made on the mining or extraction of TCM information from literatures, such as understanding ZHENG (syndrome) in TCM in the context of neuro-endocrine-immune network , extraction of clinical TCM formula data from literature , finding functional community of related genes using TCM knowledge, symptom complex , and extracting knowledge of drugs and formulae from semi-structured text . Many databases were developed to provide researchers with different aspects of TCM information, but few of them are published, presented in English, and freely accessible. For instance, TCM-ID contains information on prescriptions, herbs, herbal ingredients and 3D structures of partial herbal ingredients , the Chinese herbal medicines toxicology database holds herb monographs with a full toxicity profile and grading plus scientific evidences , the 3D structure database consists of chemical compounds isolated from Chinese traditional herbs , and the TCM database includes basic herb data, chemical components, molecular structures, and bioactive data . However, to date, there has been no attempt to develop a database integrating information about TCMs, genes and diseases using text mining. Furthermore, little attention has been given to build a knowledge base containing associations between TCMs and genes.
The purpose of this paper is to present a database, TCMGeneDIT, providing associations about TCMs, genes, diseases, effects and ingredients, and relations between TCM effects and effecters from vast amount of biomedical literature. Information about protein-protein interactions and biological pathways from public databases is also accessible. We illustrated the value of TCMGeneDIT by its utility in many aspects and two use cases about the possible therapeutic mechanisms of TCMs and synergistic or antagonistic contribution of the prescription component. The integrated genomics, proteomics, and text mining data, and user-friendly web interfaces in TCMGeneDIT make it a one-stop site for TCM and modern biomedical researchers. Thus, the database would facilitate clinical research and offer a better understanding for researchers of the TCM and gene regulation-involved therapeutic mechanisms.
The TCM names, gene names, MeSH disease terms, TCM ingredients and effects were used to annotate the literature corpus. All words were queried against documents in a case-insensitive and exact string matching manner.
We used the generic names and specific epithets such as Ganoderma lucidum of the TCM scientific names to match text. In addition, the abbreviated forms, for example, G. lucidum and G lucidum were considered.
After initial gene name identification, the most recent 100 gene-annotated documents were manually examined to reduce ambiguous gene symbols that may represent genes, diseases, cell lines, experiment techniques or other terms simultaneously. If any ambiguous terms were frequently matched in the text and do not mean genes, these gene symbols would be regarded as stop words and excluded from the gene name dictionary. We then annotated gene names in the literature again with the improved dictionary. In other words, the process can be executed after every update of the literature corpus so that the dictionary could be more specific and precise to the research domain. The approach mainly followed our pervious work . Since term ambiguity may occur between gene symbols (for example, p38 may represent MAPK1 or MAPK14), the NCBI Entrez gene IDs of the official gene symbols was assigned to the symbols or their aliases, if any, for the following association discovery.
In the disease name lexicon, if a term contains comma such as 'Tumors, Breast', all words of this term were individually queried against documents and the word sequence was ignored. If these words can be matched in a document, we assumed the document was identified by the term. Other terms without comma were directly queried against text.
The putative TCM effect list was generated by rule-based approach mentioned below. Synonyms such as 'anti-tumor' and 'anti-tumour' were manually grouped before annotation.
Annotated literature corpus was then used to find various associations including (TCM, gene), (TCM, disease), (TCM, gene, disease), (TCM, ingredient), (TCM, effect) and (gene, ingredient). The concept to discover associations was based on hypothesis testing and collocation, an expression consisting of two or more words that tend to occur together . The statistical methods based on co-occurrences can be very fast and useful for getting precise information . One can observe the co-occurrence of these entities in abstracts and calculate how improbable it is to observe a certain level of co-occurrences by chance. The null hypothesis (H0) is that the co-occurrences between TCM and entities are by chance, and so the probability of the terms appearing concurrently is given by: P(term1term2) = P(term1)P(term2). Since a term may include multiple synonyms, the occurrence probability of the term has to be represented by all its synonyms to reflect the real weight of this term. For example, if gene A has three synonyms, gene A1, gene A2 and itself and each of them occurs one, two and three times in text, respectively, then the total occurrence number of gene A is six in fact. The t test that looks at the mean and variance of a sample of measurements was then used to discover collocations. The t value was calculated with the following formula: where is the observed probability, μ is the expected probability, s 2 is the sample variance and N is the sample size, the total number of tokens in the corpus. If the t value is larger than 2.576, 2.326, 1.960 or 1.645, we can reject the null hypothesis with 99.5%, 99%, 97.5% or 95% confidence, respectively. In other words, associations discovered by chance would have low statistical values and not be shown or rank low in the results based on the threshold set by users.
To extract the relations between effecters and effects from literature, sentences were part-of-speech (POS)-tagged by Genia Tagger  and noun-phrase (NP) chunks were identified by NPchunker . The effecter and effect relations were extracted by matching preprocessed sentences with manually designed four rule patterns for high precision. For example, "[The/DT anti-androgenic/JJ activity/NN] of/IN [the/DT ethanol/NN extract/NN] of/IN [the/DT fruiting/VBG body/NN] of/IN [Ganoderma/NN lucidum/NN] has/VBZ been/VBN previously/RB reported/VBN./." is a POS-tagged and NP chunked sentence where each square bracket represents a NP.
Rule for effect extracts terms POS-tagged as adjectives and followed by effect-related keywords (for example, 'anti-tumor activity'). The corresponding pattern is as follows: [? (effect term/JJ) (selected keyword)/NN] where the selected keywords for effect identification include 'effect', 'activity', 'potential', 'property' and 'function', parenthesis represents priority, | denotes 'or' and ? denotes optional argument. The extracted terms without biological significances such as 'important' or 'great' were manually excluded.
Rule 1 extracts effect and effecter NP chunks connected by of. The effecter names could be composed of multiple chunks. The corresponding pattern is as follows: [? (effect) (keyword)/NN] of/IN [effecter 1] (((of | from)/IN [effecter 2] of/IN [effecter 3])|nil) where effecters 1, 2 and 3 represent the first, second and third NPs in the effecter name, respectively. For example, the above-mentioned example sentence can be matched by this rule and the effecter, 'the ethanol extract of the fruiting body of Ganoderma lucidum' and effect, 'anti-androgenic' are able to be extracted.
Rule 2.1 extracts effecter and effect NP chunks connected by active verbs. The effecter names are composed of two NP chunks connected by of or in. The following is the corresponding pattern: [effecter 2] (of | in)/IN [effecter 1] (modal/MD|nil) verb/VB [? (effect) (keyword)/NN]. For instance, a sentence matched by this rule can be 'the polysaccharides fraction of G. lucidum exhibited significant anti-tumor effect' and effecter, 'the polysaccharides fraction of G. lucidum', and effect, 'anti-tumor', can be extracted.
Rule 2.2 is similar to Rule 2.1, but extracts one or more effecters connected by and. The following is the corresponding pattern: [1st effecter] (((n-2 effecters) and/CC [nth effecter])|nil) (modal/MD|nil) verb/VB [? (effect) (keyword)/NN] where n is greater than or equal to two if there are multiple effecters mentioned. The sentence, for example, 'Lingzhi and San-Miao-San capsules might exert a beneficial immunomodulatory effect' is able to be matched by this rule and effecters, 'Lingzhi and San-Miao-San capsules', and effect, 'immunomodulatory', can be extracted.
Rule 3.1 extracts effecter and effect NP chunks connected by past participle verbs that are frequently observed in English for describing research discoveries. The corresponding pattern is as follows: [effecter 2] (of | in)/IN [effecter 1] (was | were)/VBD keyword/VBN to/TO verb/VB [? (effect) (keyword)/NN] where the keywords for VBN include 'found', 'shown', 'demonstrated', 'known', 'concluded', 'reported', 'proved', 'seen', 'suggested' and 'observed'. For example, this rule can match the sentence, 'hot water extract of the mushroom Ganoderma lucidum was found to exhibit antioxidative effect' and 'hot water extract of the mushroom Ganoderma lucidum' and 'antioxidative' can be extracted as effecter and effect, respectively.
Rule 3.2 is similar to Rule 3.1, but extracts one or more effecters connected by and. The corresponding pattern is as follows: [1st effecter] (((n-2 effecters) and/CC [nth effecter])|nil) (was | were)/VBD keyword/VBN to/TO verb/VB [? (effect) (keyword)/NN]. For instance, effecter, 'APBP', and effect, 'antiviral', are able to be extracted from the sentence, 'APBP was shown to have potent antiviral activity'.
Rule 4 extracts effecter and effect NP chunks connected by appositives modifying effecters, and verb phrases. The following is the corresponding pattern: [effecter],/, appositive,/, (modal/MD|nil) verb-phrase [(effect) (keyword)/NN]. For example, the sentence, 'LZ-8, a new and recently discovered immunomodulator from Ganoderma lucidum, has been shown to have immunosuppressive activity' can be matched and effecter, 'LZ-8', and effect, 'immunosuppressive', are able to be extracted, respectively.
The candidate relations having following conditions were excluded: (i) the extracted effects can't be identified in the effect set; (ii) the effecters are non-specific terms such as compound and sample; (iii) the start words of the effect NP chunks are 'no'.
We inferred transitive associations about TCMs according to Swanson's ABC model where if A and B are related, and B and C are related, then A and C might be indirectly related . For TCM and gene association inference, we assumed that gene (A) activities may be regulated by various ingredients (B) isolated from the TCMs (C) and estimated the generated relationships by the formula: , where n is the number of ingredients involved, denotes the t value for gene (A) and ingredient (Bi) and denotes the t value for ingredient (Bi) and TCM (C). As the work we referred, if the relationships between A and B, and B and C are more significant (higher t AB and t BC), the inferred associations between A and C could be more reliable (higher score calculated). Similar concept could be applied to correlate TCMs (A) with diseases (C) via known disease candidate genes (B). If the relationships between TCMs and these genes are more significant, the associations between TCMs and diseases contributed by these genes could be more promising.
In order to retrieve association information regarding Ganoderma lucidum, for example, users can select 'TCM Herb' search type, input Ganoderma lucidum as the search term and then click on the "Search" button. If the search is matched, the TCM information, including three links for TCM-Gene, TCM-Disease and TCM-Gene-Disease, ingredients collected from other databases and mined from literature, potential effects from text mining, and putative relations between effecters and effects extracted from the text will be shown. All association information is presented with t values in different colors, red (>2.576), green (>2.326) and blue (>1.960). Users can then follow the links for evidences in sentences from the literature, highlighted keywords, identified genes, journals and publication years. To examine the relationships between Ganoderma lucidum and genes, users can click on the TCM-Genes link and the returned web page will contain associations ranked by the t values and vote number. The vote features were presented for users with domain knowledge to recommend correct associations. Users are able to examine associations with preferred confidence levels. The default value is 97.5%. Evidence extracted from the literature can be accessed by following the link showing the number of paper. Users can then follow the links of gene names for general information and associations between specific genes and different TCMs. Furthermore, users are able to find which genes and diseases may be related to Ganoderma lucidum via the link TCM-Gene-Disease. The default results are associations purely generated from collocation analysis. The links of diseases are for users to examine disease descriptions from MeSH and which TCMs may be related to the disease. Additional TCM, gene and disease association information could be accessed by transitive inference, as the relationships between genes and diseases collected from PharmGKB are integrated to the text mining results of TCMs and genes.
To search for association information using gene names, users can select 'Gene Symbol' search type, for example, input 'TNF-alpha' as search term and select the gene of interest in the returned page. If users follow the link of TNF gene ID, a variety of gene information are shown, including gene description, function, two links for TCM-Gene and TCM-Gene-Disease, Gene Ontology annotation, involved biological pathways and interacting protein partners. For those who would like to find the associations of specific disease, such as Alzheimer's disease, with TCMs or both the TCMs and genes, can simply select the disease name in the main search interface. Disease description and three links for TCM-Disease, TCM-Gene-Disease and TCM-Gene-Disease (transitive) will be presented. The transitive disease, TCM and gene associations are inferred via common TCMs and preferred confidence levels for disease-TCM and TCM-gene associations can also be set. The associations between TCMs and particular effects can be accessed in the third section of the main search interface. For example, Ganoderma lucidum may be most associated with 'antitumor' effect.
In addition to providing associations between TCMs and genes by their co-occurrences, TCMGeneDIT could infer their transitive relationships through the shared intermediates, ingredients. For example, in the initial result page of Ganoderma lucidum and gene associations, users can follow the link 'by transitive inference' and select one or more ingredients of interests to perform relationship inference. Then, our system would try to find putative associations between TCMs and genes according to the sorted scores calculated by the above-mentioned formula: . In this example, TNF (Tumor Necrosis Factor) is suggested to be the gene highly associated with Ganoderma lucidum if users select polysaccharide and triterpene, the main ingredients of the TCM, to infer the relationships. Potential transitive associations with high scores but not explicitly found in the literature may be regarded as candidates for hypotheses generation.
Evaluation of the associations between TCMs and various entities from collocation analysis
(TCM, Gene, Disease)
Number of associations
Minimum confidence (%)
Number of TCMs contributed to the associations
TCMGeneDIT could be used to understand the possible therapeutic mechanisms of TCMs via gene regulations. Take Ganoderma lucidum and anti-tumor for example. From the database TCM search, the mining results indicated Ganoderma lucidum may be highly associated with anti-tumor effect (t-value>6.1). In addition, the TCM and gene association information showed IFNG (interferon, gamma), IL1B (interleukin 1, beta), TNF, NFKB1 (nuclear factor of kappa light polypeptide gene enhancer in B-cells 1) and IL2 (interleukin 2) genes were significantly related to Ganoderma lucidum (t-values >3.3). Furthermore, the functional classifications of Gene Ontology available in the gene search results revealed that most these genes are involved in apoptosis (80%) and cell proliferation (60%), in which both are correlated with tumorigenesis. The findings suggested Ganoderma lucidum may have therapeutic effects on cancers through the regulations of apoptosis and cell proliferation. Recent studies have shown that Ganoderma lucidum can induce apoptosis of prostate cancer cells (PC-3) via decreasing the expression of NFKB1-regulated genes  and Ganoderma lucidum is able to inhibit TNF-induced proliferation of human breast cancer cells through modulation of the NFKB1 signaling . On the other hand, several active ingredients such as polysaccharides, triterpenes, ganoderic acids highly associated with Ganoderma lucidum (t-values>5) may contribute to the putative mechanisms. As our expected, Ganoderma lucidum polysaccharides have been shown that can modulate the concentrations of IL2, IFNG, TNF, and NFKB1 in patients with advanced colorectal cancer , and ganoderic acid Me can inhibit tumor growth and lung metastasis through increasing the expressions of IL2, IFNG and NFKB1 . These results indicated that TCMGeneDIT could suggest the possible therapeutic mechanisms involved by TCMs, genes and TCM ingredients, and provide a one-stop site for TCM and modern biomedical researchers.
TCMGeneDIT may also be useful to determine which components of the prescription composite formulae will produce a synergistic effect or an antagonistic action. We applied TCMGeneDIT to 3 immune-related prescriptions, Dang Guei Bu Syue Tang composed of Astragalus membranaceus (Huang Ci) and Angelica sinensis (Dang Guei), Sheng Mai San composed of Panax ginseng (Ren Shen Ye), Ophiopogon japonicus (Cun Dong) and Schisandra chinensis (Bei Wu Wei), and Sih Jyun Zih Tang composed of Codonopsis pilosula (Dang Shen), Atractylodes macrocephala (Bai Jhu), Poria cocos (Bai Fu Ling) and Radix Glycyrrhizae (Wu Jhu Yu). From the database TCM effect search, we could find all herbs to be immune-related such as immunopotentiating, immunological, immunomodulatory or anti-inflammatory. In the case of Dang Guei Bu Syue Tang, both herbs had high t-values, 3.303 and 2.221, respectively, for antitumor effect and were both significantly associated with IL-2 and TNF. Both IL-2, a secreted cytokine important for the proliferation of T and B lymphocytes, and TNF, a multifunctional proinflammatory cytokine secreted by macrophages, are also known to be involved in cytokine-cytokine receptor interaction, which can be found from our database gene search. Thus, Astragalus membranaceus and Angelica sinensis may synergistically promote the immune effect of Dang Guei Bu Syue Tang. On the other hand, in Sheng Mai San, Panax ginseng is highly related to immune or immunomodulatory effects but Ophiopogon japonicus and Schisandra chinensis are only slightly immune-related (confidence < 95%). Both Panax ginseng and Ophiopogon japonicus are associated with TNF, while Panax ginseng and Schisandra chinensis are both related to IL-6 (interleukin 6), IL-4 (interleukin 4) and IL-10 (interleukin 10). In addition to cytokine-cytokine receptor interaction, IL-6, IL-4 and IL-10 are all involved in the regulation of the Jak-STAT signaling pathway and cytokines and inflammatory responses, which are supported by the KEGG and CGAP -integrated gene search results. Therefore, Panax ginseng may very possibly be the major component of Sheng Mai San, in which the immune function is produced by synergistic herb pairs. Although more investigation is required to find the underlying therapeutic mechanisms of the various prescriptions, TCMGeneDIT would certainly facilitate the understanding of the roles played by herb components in producing prescription effects.
The TCMGeneDIT differs from the TCM-ID in several ways. First, TCMGeneDIT provides association information about TCMs mined and extracted from a large amount of literature based on collocation analysis, as it joins TCMs with biomedical studies and modern life sciences, especially genomics and proteomics. In addition, the transitive relationships between TCMs and genes and/or diseases could be inferred through TCM ingredients and information integration. Furthermore, integrated protein-protein interaction and biological pathway could help to suggest the underlying therapeutic mechanisms of TCMs and genes.
Future research includes expanding database contents by adding information on prescription composite formulae, TCM ingredient structures, as well as integrating the relationships among Western drugs, genes and diseases, and creating TCM Ontology to describe TCMs and their properties and attributes. We are also planning on developing new rules and extend existing patterns for extracting more relations about TCMs. Furthermore, we intend to offer query functions for network visualization, thus enhancing the study of therapeutic pathways involving TCMs and genes.
We developed a unique database, TCMGeneDIT, which provides association information about TCMs, genes and diseases using scientific text mining, integrating TCMs with biomedical studies. It will not only help facilitate the understanding of therapeutic mechanisms involving TCM and gene interactions, but also be constructive to modern clinical research.
This research supported by National Science Council, Taiwan and NTU Frontier and Innovative Research Projects (NTUPFIR-96R0107). We thank Jason Lee for proofreading the manuscript and Jia-Wei Hsu for testing the database.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.