Common File Formats in DNA/RNA Sequencing
This module introduces key file formats used throughout the sequencing data pipeline, from raw reads to variant calling. Understanding these formats is essential for anyone working with next-generation sequencing (NGS) data.
1. FASTQ - Raw Read Format
- Description: Stores raw sequencing reads along with quality scores.
- Type: Text-based
- Structure:

- Fields:
@ReadID: Identifier for the sequencing read.- Sequence line: Nucleotide sequence.
+separator (optional description).- Quality scores (ASCII-encoded).
- Used In: Raw data from sequencing platforms, input for aligners (e.g., BWA, Bowtie2).
2. SAM - Sequence Alignment/Map
- Description: Human-readable format for aligned reads.
- Type: Text-based
- Structure:
- - Header lines start with
@ - Alignment lines include:
- Read name
- Flag
- Reference name
- Position
- Mapping quality
- CIGAR string
- Mate information
- Sequence
- Quality
- Used In: Intermediate results; inspection/debugging of alignments.
3. BAM - Binary Alignment/Map
- Description: Binary, compressed version of SAM.
- Type: Binary
- Benefits:
- Smaller file size
- Faster for analysis and processing
- Used In: Standard format for storing aligned NGS data.
4. CRAM - Compressed Reference-based Format
- Description: Highly compressed alternative to BAM.
- Type: Binary (uses reference-based compression)
- Benefits:
- More storage-efficient than BAM
- Requires access to the reference genome for decompression
- Used In: Large-scale sequencing projects; long-term storage.
5. VCF - Variant Call Format
- Description: Stores genetic variation (SNPs, indels).
- Type: Text-based
- Structure:

- Fields:
- Chromosome
- Position
- Reference/alternate alleles
- Quality metrics
- Genotype information
- Used In: Output of variant callers (e.g., GATK, FreeBayes).
6. BCF - Binary Call Format
- Description: Binary version of VCF.
- Type: Binary
- Benefits:
- Compact
- Faster for computational tasks
- Used In: Efficient analysis and storage of variant data.
Summary Table
| Format | Type | Purpose | Input/Output for |
|---|---|---|---|
| FASTQ | Text | Raw sequencing reads | Aligners |
| SAM | Text | Aligned reads (human-readable) | Alignment tools |
| BAM | Binary | Compressed aligned reads | Analysis tools |
| CRAM | Binary | Reference-compressed alignments | Archival storage |
| VCF | Text | Genetic variants | Variant callers |
| BCF | Binary | Compressed variant format | Genomic analysis |