Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corre-sponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in Pub-Med abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the EMU (Extractor of MUtations) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder?a tool to extract point mutations from text. Our results show that both methods achieve comparable per-formance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational informa-tion and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of anno-tated mutations. We found 51 and 128 mutations manually verified to be related to PCa and BCa respectively that are not currently annotated for these cancer types in the OMIM or Swiss-Prot data-bases. EMU's retrieval performance represents a two-fold improve-ment in the number of annotated mutations for PCa and BCa
Availability: Freely available at http://bioinf.umbc.edu/EMU/ftp.