Original author(s) | David Perkins and Darryl Pappin |
---|---|
Initial release | 1999 |
Stable release | 2.6.00
/ December 2016 |
Operating system | Linux or Windows |
Available in | C |
Type | Protein identification Bioinformatics |
License | proprietary, free for online use |
Website | http://www.matrixscience.com/ |
Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases.[1][2] Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science.[3] A license is required for in-house use where more features can be incorporated.
MOWSE was one of the first algorithms developed for protein identification using peptide mass fingerprinting.[4] It was originally developed in 1993 as a collaboration between Darryl Pappin of the Imperial Cancer Research Fund (ICRF) and Alan Bleasby of the Science and Engineering Research Council (SERC). MOWSE stood apart from other protein identification algorithms in that it produced a probability-based score for identification. It was also the first to take into account the non-uniform distribution of peptide sizes, caused by the enzymatic digestion of a protein that is needed for mass spectrometry analysis. However, MOWSE was only applicable to peptide mass fingerprint searches and was dependent on pre-compiled databases which were inflexible with regard to post-translational modifications and enzymes other than trypsin. To overcome these limitations, to take advantage of multi-processor systems and to add non-enzymatic search functionality, development was begun again from scratch by David Perkins at the Imperial Cancer Research Fund. The first versions were developed for Silicon Graphics Irix and Digital Unix systems. Eventually this software was named Mascot and to reach a wider audience, an external bioinformatics company named Matrix Science was created by David Creasy and John Cottrell to develop and distribute Mascot. Legacy software versions exist for Tru64, Irix, AIX, Solaris, Microsoft Windows NT4 and Microsoft Windows 2000. Mascot has been available as a free service on the Matrix Science website since 1999 and has been cited in scientific literature over 5,000 times. Matrix Science still continues to work on improving Mascot’s functionality.
Mascot identifies proteins by interpreting mass spectrometry data. The prevailing experimental method for protein identification is a bottom-up approach, where a protein sample is typically digested with Trypsin to form smaller peptides. While most proteins are too big, peptides usually fall within the limited mass range that a typical mass spectrometer can measure. Mass spectrometers measure the molecular weights of peptides in a sample. Mascot then compares these molecular weights against a database of known peptides. The program cleaves every protein in the specified search database in silico according to specific rules depending on the cleavage enzyme used for digestion and calculates the theoretical mass for each peptide. Mascot then computes a score based on the probability that the peptides from a sample match those in the selected protein database. The more peptides Mascot identifies from a particular protein, the higher the Mascot score for that protein.
The software processes data from mass spectrometers of the following companies:
Mascot’s fundamental approach to identifying peptides is to calculate the probability whether an observed match between experimental data and peptide sequences found in a reference database has occurred by chance. The match with the lowest probability of occurring by chance is returned as the most significant match. The significance of the match depends on the size of the database that is being queried. Mascot employs the widely used significance level of 0.05, meaning that in a single test the probability of observing an event at random is less than or equal to 1 in 20. In this light, a score of 10−5 might seem very promising. However, if the database being searched contains 106 sequences several scores of this magnitude would be expected by chance alone because the algorithm carried out 106 individual comparisons. For a database of that size, by applying a Bonferroni correction to account for multiple comparisons, the significance threshold drops to 5*10−8.[1]
In addition to the calculated peptide scores, Mascot also estimates the False Discovery Rate (FDR) by searching against a decoy database. When performing a decoy search, Mascot generates a randomized sequence of the same length for every sequence in the target database. The decoy sequence is generated such that it has the same average amino acid composition as the target database. The FDR is estimated as the ratio of decoy database matches to target database matches. This relates to the standard formula FDR = FP / (FP + TP), where FP are false positives and TP are true positives. The decoy matches are certain to be spurious identifications, but we can't discriminate between true and false positives identified in the target database. FDR estimation was added in response to journals' guidelines on protein identification reports like the ones from Molecular and Cellular Proteomics.[5] Mascot's FDR calculation incorporates ideas from different publications.[6][7]
The most common alternative database search programs are listed in the Mass spectrometry software article. The performance of a variety of mass spectrometry software, including Mascot, can be observed in the 2011 iPRG study. Genome-based peptide fingerprint scanning is another method that compares the peptide fingerprints to the entire genome instead of only annotated genes.