Background
Pathway data are important for understanding the relationship between genes, proteins
and many other molecules in living organisms. Pathway gene relationships are crucial
information for guidance, prediction, reference and assessment in biochemistry, computational
biology, and medicine. Many well-established databases--e.g., KEGG, WikiPathways,
and BioCyc--are dedicated to collecting pathway data for public access. However, the
effectiveness of these databases is hindered by issues such as incompatible data formats,
inconsistent molecular representations, inconsistent molecular relationship representations,
inconsistent referrals to pathway names, and incomprehensive data from different databases.
Results
In this paper, we overcome these issues through extraction, normalization and integration
of pathway data from several major public databases (KEGG, WikiPathways, BioCyc, etc).
We build a database that not only hosts our integrated pathway gene relationship data
for public access but also maintains the necessary updates in the long run. This public
repository is named IntPath (Integrated Pathway gene relationship database for model organisms and important pathogens). Four
organisms--S. cerevisiae, M. tuberculosis H37Rv, H. Sapiens and M. musculus--are included in this version (V2.0) of IntPath. IntPath uses the "full unification"
approach to ensure no deletion and no introduced noise in this process. Therefore,
IntPath contains much richer pathway-gene and pathway-gene pair relationships and
much larger number of non-redundant genes and gene pairs than any of the single-source
databases. The gene relationships of each gene (measured by average node degree) per
pathway are significantly richer. The gene relationships in each pathway (measured
by average number of gene pairs per pathway) are also considerably richer in the integrated
pathways. Moderate manual curation are involved to get rid of errors and noises from
source data (e.g., the gene ID errors in WikiPathways and relationship errors in KEGG).
We turn complicated and incompatible xml data formats and inconsistent gene and gene
relationship representations from different source databases into normalized and unified
pathway-gene and pathway-gene pair relationships neatly recorded in simple tab-delimited
text format and MySQL tables, which facilitates convenient automatic computation and
large-scale referencing in many related studies. IntPath data can be downloaded in
text format or MySQL dump. IntPath data can also be retrieved and analyzed conveniently
through web service by local programs or through web interface by mouse clicks. Several
useful analysis tools are also provided in IntPath.
Link doesn't work.
ReplyDeleteThanks for the posts, I have just fixed the link, IntPath is working now. Sorry for the inconvenience.
Delete