Short description: File format for sequence data
Pileup| Filename extensions | .msf, .pup, .pileup |
|---|
| Developed by | Tony Cox and Zemin Ning |
|---|
| Type of format | Bioinformatics |
|---|
| Extended from | Tab separated values |
|---|
| Website | www.htslib.org/doc/samtools-mpileup.html |
|---|
Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by
Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite.
[1]
Format
Example
| Sequence |
Position |
Reference Base |
Read Count |
Read Results |
Quality
|
| seq1 |
272 |
T |
24 |
,.$.....,,.,.,...,,,.,..^+. |
<<<+;<<<<<<<<<<<=<;<;7<&
|
| seq1 |
273 |
T |
23 |
,.....,,.,.,...,,,.,..A |
<<<;<<<<<<<<<3<=<<<;<<+
|
| seq1 |
274 |
T |
23 |
,.$....,,.,.,...,,,.,... |
7<7;<;<<<<<<<<<=<;<;<<6
|
| seq1 |
275 |
A |
23 |
,$....,,.,.,...,,,.,...^l. |
<+;9*<<<<<<<<<=<<:;<<<<
|
| seq1 |
276 |
G |
22 |
...T,,.,.,...,,,.,.... |
33;+<<7=7<<7<&<<1;<<6<
|
| seq1 |
277 |
T |
22 |
....,,.,.,.C.,,,.,..G. |
+7<;<<<<<<<&<=<<:;<<&<
|
| seq1 |
278 |
G |
23 |
....,,.,.,...,,,.,....^k. |
8*<<;<7<<7<=<<<;<<<<<
|
| seq1 |
279 |
C |
23 |
A..T,,.,.,...,,,.,..... |
75&<<<<<<<<<=<<<9<<:<<<
|
The columns
Each line consists of 5 (or optionally 6) tab-separated columns:
- Sequence identifier
- Position in sequence (starting from 1)
- Reference nucleotide at that position
- Number of aligned reads covering that position (depth of coverage)
- Bases at that position from aligned reads
- Phred Quality of those bases, represented in ASCII with -33 offset (OPTIONAL)
Column 5: The bases string
- . (dot) means a base that matched the reference on the forward strand
- , (comma) means a base that matched the reference on the reverse strand
- </> (less-/greater-than sign) denotes a reference skip. This occurs, for example, if a base in the reference genome is intronic and a read maps to two flanking exons. If quality scores are given in a sixth column, they refer to the quality of the read and not the specific base.
- AGTCN (upper case) denotes a base that did not match the reference on the forward strand
- agtcn (lower case) denotes a base that did not match the reference on the reverse strand
- A sequence matching the regular expression
\+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position. For example, +2AG means insertion of AG in the forward strand
- A sequence matching the regular expression
\-[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position. For example, -2ct means deletion of CT in the reverse strand
- ^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality
- $ (dollar) marks the end of a read segment
- * (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the
-[0-9]+[ACGTNacgtn]+ notation
Column 6: The base quality string
This is an optional column. If present, the ASCII value of the character minus 33 gives the mapping Phred quality of each of the bases in the previous column 5. This is similar to quality encoding in the FASTQ format.
File extension
There is no standard file extension for a Pileup file, but .msf (multiple sequence file), .pup[2] and .pileup[3][4] are used.
See also
- Variant Call Format
- FASTQ format
- List of file formats for molecular biology
References
- ↑ Li H.; Handsaker B.; Wysoker A.; Fennell T.; Ruan J.; Homer N.; Marth G.; Abecasis G. et al. (2009). "The Sequence alignment/map (SAM) format and SAMtools". Bioinformatics 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMID 19505943.
- ↑ Accelrys (1998-10-02). "QUANTA: Protein Design. 3. Reading and Writing Sequence Data Files". Université de Montréal. http://www.esi.umontreal.ca/accelrys/life/quanta2K/protein/03_Sequence_data_files.html.
- ↑ Glez-Peña, Daniel; Gómez-López, Gonzalo; Reboiro-Jato, Miguel; Fdez-Riverola, Florentino; Pisano, David G (2011-01-24). "PileLine: a toolbox to handle genome position information in next-generation sequencing studies". BMC Bioinformatics 12: 31. doi:10.1186/1471-2105-12-31. ISSN 1471-2105. PMID 21261974.
- ↑ Chisom, Halimat (2023-03-31). "File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some)" (in en). https://medium.com/@gearthdexter/bioinformatics-file-formats-3919a26b7679.
External links
- SAMtools pileup description
- bioruby-pileup_iterator (A Ruby pileup parser)
- pysam (A Python pileup parser)
Bioinformatics |
|---|
| Databases |
- Sequence databases: GenBank, European Nucleotide Archive and DNA Data Bank of Japan
- Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource
- Other databases: Protein Data Bank, Ensembl and InterPro
- Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase]], VectorBase]], WormBase, PHI-base, Arabidopsis Information Resource and Zebrafish Information Network
|
|---|
| Software |
- BLAST
- Bowtie
- Clustal
- HMMER
- MUSCLE
- SAMtools
- TopHat
|
|---|
| Other |
- Server: ExPASy
- Ontology: Gene Ontology
- Rosalind (education platform)
|
|---|
| Institutions |
- Broad Institute
- Computational Biology Department (CBD)
- Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
- Database Center for Life Science (DBCLS)
- DNA Data Bank of Japan (DDBJ)
- European Bioinformatics Institute (EMBL-EBI)
- European Molecular Biology Laboratory (EMBL)
- Flatiron Institute
- J. Craig Venter Institute (JCVI)
- Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG)
- US National Center for Biotechnology Information (NCBI)
- Japanese Institute of Genetics
- Netherlands Bioinformatics Centre (NBIC)
- Philippine Genome Center (PGC)
- Scripps Research
- Swiss Institute of Bioinformatics (SIB)
- Wellcome Sanger Institute
- Whitehead Institute
|
|---|
| Organizations |
- African Society for Bioinformatics and Computational Biology (ASBCB)
- Australia Bioinformatics Resource (EMBL-AR)
- European Molecular Biology network (EMBnet)
- International Nucleotide Sequence Database Collaboration (INSDC)
- International Society for Biocuration (ISB)
- International Society for Computational Biology (ISCB)
- Student Council (ISCB-SC)
- Institute of Genomics and Integrative Biology (CSIR-IGIB)
- Japanese Society for Bioinformatics (JSBi)
|
|---|
| Meetings |
- Basel Computational Biology Conference ([BC2])
- European Conference on Computational Biology (ECCB)
- Intelligent Systems for Molecular Biology (ISMB)
- International Conference on Bioinformatics (InCoB)
- ISCB Africa ASBCB Conference on Bioinformatics
- Pacific Symposium on Biocomputing (PSB)
- Research in Computational Molecular Biology (RECOMB)
|
|---|
| File formats |
- CRAM format
- FASTA format
- FASTQ format
- NeXML format
- Nexus format
- Pileup format
- SAM format
- Stockholm format
|
|---|
| Related topics |
- Computational biology
- List of biological databases
- Molecular phylogenetics
- Sequencing
- Sequence database
- Sequence alignment
|
|---|
Category
Commons
|
 | Original source: https://en.wikipedia.org/wiki/Pileup format. Read more |