DNA Compression

Description

Apply compression algorithms to reduce the storage of DNA sequences. mtDNA has been used as a model for compressing full genome sequences. Chromosome 1 of HapMap data has also been tried. The downloadable software allows one to use their own data set.


Data

Software

  • Should work on most any unix-based operating system (written on a Mac).
  • Requires Perl and certain Perl modules listed in the documentation.
  • Download Software (gunzip, untar, and see the README)
  • SQLite is required only if working with HapMap data
  • SQLite chromosome 1 database (1.2 GB) is required only if working with HapMap data - gunzip and untar this file in the same directory that you used for the compression software
  • Contact Marty Brandon if you have difficulty.

Results

mtDNA

Encodings

HapMap

Encodings

  • Position-variant
    • analysis
    • consensus - A list of the consensus haplotypes at each variable position in chromosome 1.
    • Assumes one position-variant pair per line followed by a one-character newline.
    • For each position, a consensus haplotype was computed and data matching the consensus haplotype was removed.
    • File identifiers are not included in the size calculation.
    • Positions are the absolute DNA position as reported in the HapMap data file.
    • Use the average number of variants per sequence reported here together with the expected costs per runlength reported in the encoding schemes below to get the expected cost per sequence.
  • Huffman
  • Golomb
  • Elias-Gamma
  • Unary
    • analysis - Not a practical encoding. Just did this one as an exercise.

Topic attachments
I Attachment Action Size Date Who Comment
pptppt ASHG08_HapMapTutorial.ppt manage 7082.0 K 03 Feb 2009 - 19:55 MartyBrandon HapMap powerpoint tutorial downloaded from the HapMap site.
txttxt accession_list.txt manage 35.3 K 01 Dec 2008 - 16:22 MartyBrandon List of Genbank accession numbers for the sequences used.
elsegz chr1.tar.gz manage 1093891.7 K 13 Apr 2009 - 17:04 MartyBrandon sqlite database file for chromosome 1
txttxt chr1_hapmap_consensus.txt manage 3663.3 K 07 Mar 2009 - 17:53 MartyBrandon HapMap chromosome 1 consensus variants.
txttxt chr1_runlength_Elias-Gamma.txt manage 2801.6 K 07 Mar 2009 - 15:36 MartyBrandon Chromosome 1 runlength encoding using Elias-Gamma
txttxt chr1_runlength_Golomb.txt manage 2005.6 K 07 Mar 2009 - 15:37 MartyBrandon Chromosome 1 runlength encoding using Golomb
txttxt chr1_runlength_Huffman.txt manage 2331.2 K 07 Mar 2009 - 15:38 MartyBrandon Chromosome 1 runlength encoding using Huffman
txttxt chr1_runlength_counts.txt manage 1719.5 K 07 Mar 2009 - 15:22 MartyBrandon HapMap chromosome 1 runlength counts
txttxt chr1_runlength_freqs.txt manage 1592.1 K 07 Mar 2009 - 15:28 MartyBrandon HapMap chromosome 1 runlength frequencies
txttxt chr1_variant_Elias-Gamma.txt manage 0.1 K 07 Mar 2009 - 15:39 MartyBrandon Chromosome 1 variant encoding using Elias-Gamma
txttxt chr1_variant_Golomb.txt manage 0.1 K 07 Mar 2009 - 15:40 MartyBrandon Chromosome 1 varian encoding using Golomb
txttxt chr1_variant_Huffman.txt manage 0.1 K 07 Mar 2009 - 15:40 MartyBrandon Chromosome 1 variant encoding using Huffman
txttxt chr1_variant_Unary.txt manage 0.1 K 07 Mar 2009 - 16:06 MartyBrandon Chromosome 1 variant encoding using Unary
txttxt chr1_variant_counts.txt manage 0.2 K 07 Mar 2009 - 15:23 MartyBrandon HapMap chromosome 1 variant counts
txttxt chr1_variant_freqs.txt manage 0.2 K 07 Mar 2009 - 15:28 MartyBrandon HapMap chromosome 1 variant freqs
txttxt compression_results.txt manage 0.8 K 24 Nov 2008 - 14:07 MartyBrandon Final compression results for the full collection of mtDNA sequences.
elsegz dna_compression_software.tar.gz manage 6117.6 K 13 Apr 2009 - 18:07 MartyBrandon DNA compression software
txttxt hapmap_chr1_Elias-Gamma.txt manage 0.3 K 07 Mar 2009 - 15:50 MartyBrandon Analysis of the HapMap chromosome 1 Elias-Gamma encoding
txttxt hapmap_chr1_Golomb.txt manage 0.3 K 07 Mar 2009 - 15:50 MartyBrandon Analysis of the HapMap chromosome 1 Golomb encoding
txttxt hapmap_chr1_Huffman.txt manage 0.3 K 07 Mar 2009 - 15:51 MartyBrandon Analysis of the HapMap chromosome 1 Huffman encoding
txttxt hapmap_chr1_Unary.txt manage 0.3 K 07 Mar 2009 - 15:51 MartyBrandon Analysis of the HapMap chromosome 1 Unary encoding
txttxt hapmap_chr1_compression_factors.txt manage 0.9 K 17 Mar 2009 - 10:19 MartyBrandon Compression factors computed for each of the encodings.
txttxt hapmap_chr1_data_size.txt manage 0.6 K 09 Mar 2009 - 11:15 MartyBrandon File sizes of the HapMap data for chromosome 1
txttxt hapmap_chr1_pv.txt manage 0.9 K 10 Mar 2009 - 16:25 MartyBrandon Analysis of the HapMap chromosome 1 position-variant notation
txttxt mtdna_consensus_Elias-Gamma.txt manage 0.4 K 13 Mar 2009 - 16:49 MartyBrandon Analysis of mtDNA consensus encoding using Elias-Gamma
txttxt mtdna_consensus_Golomb.txt manage 0.4 K 13 Mar 2009 - 16:42 MartyBrandon Analysis of mtDNA consensus encoding using Golomb
txttxt mtdna_consensus_Huffman.txt manage 0.4 K 13 Mar 2009 - 16:42 MartyBrandon Analysis of mtDNA consensus encoding using Huffman
txttxt mtdna_consensus_compression_factors.txt manage 0.9 K 15 Mar 2009 - 16:54 MartyBrandon Compression factors computed for mtdna_consensus
txttxt mtdna_consensus_pv.txt manage 0.9 K 13 Mar 2009 - 16:43 MartyBrandon Analysis of the mtDNA_consensus position-variant notation
txttxt mtdna_consensus_runlength_counts.txt manage 61.4 K 12 Mar 2009 - 16:17 MartyBrandon mtDNA consensus runlength counts
txttxt mtdna_consensus_runlength_freqs.txt manage 75.0 K 12 Mar 2009 - 16:13 MartyBrandon mtDNA consensus runlength freqs
txttxt mtdna_consensus_variant_counts.txt manage 1.0 K 12 Mar 2009 - 16:17 MartyBrandon mtDNA consensus variant counts
txttxt mtdna_consensus_variant_freqs.txt manage 0.9 K 12 Mar 2009 - 16:14 MartyBrandon mtDNA consensus variant freqs
txttxt mtdna_file_sizes.txt manage 0.4 K 13 Mar 2009 - 16:43 MartyBrandon Predicted file sizes for mtDNA data
txttxt mtdna_rcrs_Elias-Gamma.txt manage 0.4 K 13 Mar 2009 - 16:44 MartyBrandon Analysis of mtDNA rCRS encoding using Elias-Gamma
txttxt mtdna_rcrs_Golomb.txt manage 0.4 K 13 Mar 2009 - 16:45 MartyBrandon Analysis of mtDNA rCRS encoding using Golomb
txttxt mtdna_rcrs_Huffman.txt manage 0.4 K 13 Mar 2009 - 16:45 MartyBrandon Analysis of mtDNA rCRS encoding using Huffman
txttxt mtdna_rcrs_compression_factors.txt manage 0.9 K 15 Mar 2009 - 16:53 MartyBrandon Compression factors computed for mtdna_rcrs
txttxt mtdna_rcrs_pv.txt manage 0.9 K 13 Mar 2009 - 16:45 MartyBrandon Analysis of the mtDNA_rCRS position-variant notation
txttxt mtdna_rcrs_runlength_counts.txt manage 45.7 K 13 Mar 2009 - 15:39 MartyBrandon mtDNA rCRS runlength counts
txttxt mtdna_rcrs_runlength_freqs.txt manage 55.8 K 12 Mar 2009 - 16:10 MartyBrandon mtDNA rCRS runlength freqs
txttxt mtdna_rcrs_variant_counts.txt manage 1.0 K 13 Mar 2009 - 15:39 MartyBrandon mtDNA rCRS variant counts
txttxt mtdna_rcrs_variant_freqs.txt manage 0.9 K 12 Mar 2009 - 16:08 MartyBrandon mtDNA rCRS variant freqs
txttxt mtdna_sequence_stats.txt manage 0.5 K 13 Mar 2009 - 09:24 MartyBrandon Sequence statistics for mtDNA sequences.
Topic revision: r51 - 13 Apr 2009 - 18:07:35 - MartyBrandon
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding MAMMAG Web? Send feedback