DATA COMPRESSION FOR DNA SEQUENCE

Abstract

DNA Sequences making up any organism comprise the basic blueprint of that organism so that understanding and analyzing different genes within sequences has become an extremely important task. Biologists are producing huge volumes of DNA sequences every day that makes genome sequence database growing exponentially. The databases such as Gen-Bank represent millions of DNA sequences filling many thousands of gigabytes computer storage capacity. Hence an efficient algorithm to compress DNA sequence is required. In this paper compression algorithm which is called “Huffman code tree” is used to code and compress DNA sequences. Depending upon this algorithm we assigning binary bit codes (0 and 1) for each base (A, T, C, and G). After assigning the bases by bit codes, we determine the code for each base. Code for each base is determined by tracing out the path from the root of the tree to the leaf that represents that base. Huffman code provides a variable code length. In fact the codes for characters having a higher frequency of occurrence are shorter than those codes for characters having lower frequency. So this algorithm compress DNA sequences better than from old method (fixed length) if we assigning 2 bits per base. From analysis the results, average code length (1.62 bits/base) can be achieved using this algorithm. For a higher compression ratio advised to use other compression method with the proposed method such as the learning automata.