Common File Formats in DNA/RNA Sequencing


This module introduces key file formats used throughout the sequencing data pipeline, from raw reads to variant calling. Understanding these formats is essential for anyone working with next-generation sequencing (NGS) data.


1. FASTQ - Raw Read Format

  • Description: Stores raw sequencing reads along with quality scores.
  • Type: Text-based
  • Structure: img.png
  • Fields:
  • @ReadID: Identifier for the sequencing read.
  • Sequence line: Nucleotide sequence.
  • + separator (optional description).
  • Quality scores (ASCII-encoded).
  • Used In: Raw data from sequencing platforms, input for aligners (e.g., BWA, Bowtie2).

2. SAM - Sequence Alignment/Map

  • Description: Human-readable format for aligned reads.
  • Type: Text-based
  • Structure: img_1.png-
  • Header lines start with @
  • Alignment lines include:
    • Read name
    • Flag
    • Reference name
    • Position
    • Mapping quality
    • CIGAR string
    • Mate information
    • Sequence
    • Quality
  • Used In: Intermediate results; inspection/debugging of alignments.

3. BAM - Binary Alignment/Map

  • Description: Binary, compressed version of SAM.
  • Type: Binary
  • Benefits:
  • Smaller file size
  • Faster for analysis and processing
  • Used In: Standard format for storing aligned NGS data.

4. CRAM - Compressed Reference-based Format

  • Description: Highly compressed alternative to BAM.
  • Type: Binary (uses reference-based compression)
  • Benefits:
  • More storage-efficient than BAM
  • Requires access to the reference genome for decompression
  • Used In: Large-scale sequencing projects; long-term storage.

5. VCF - Variant Call Format

  • Description: Stores genetic variation (SNPs, indels).
  • Type: Text-based
  • Structure: img.png
  • Fields:
  • Chromosome
  • Position
  • Reference/alternate alleles
  • Quality metrics
  • Genotype information
  • Used In: Output of variant callers (e.g., GATK, FreeBayes).

6. BCF - Binary Call Format

  • Description: Binary version of VCF.
  • Type: Binary
  • Benefits:
  • Compact
  • Faster for computational tasks
  • Used In: Efficient analysis and storage of variant data.

Summary Table

FormatTypePurposeInput/Output for
FASTQTextRaw sequencing readsAligners
SAMTextAligned reads (human-readable)Alignment tools
BAMBinaryCompressed aligned readsAnalysis tools
CRAMBinaryReference-compressed alignmentsArchival storage
VCFTextGenetic variantsVariant callers
BCFBinaryCompressed variant formatGenomic analysis

Further Reading