際際滷

際際滷Share a Scribd company logo
Text Compression  Chapter 2
To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video). Why Compress ? In almost all multimedia applications,  a technique known as  compression  is first applied to the source information prior to its  Transmission.  This is done either to reduce the volume of information to be transmitted   text , fax and images or to reduce the bandwidth that is required for its  transmission  speech , audio and video.
Compression How is compression possible? Redundancy in digital audio, image, and video data Properties of human perception Digital audio is a series of sample values image is a rectangular array of pixel values  video is a sequence of images played out at a certain rate Neighboring sample values are correlated
Adjacent audio samples are similar (predictive encoding) samples corresponding to silence (silence removal) In digital image, neighboring samples on a scanning line are normally similar (spatial redundancy) In digital video, in addition to spatial redundancy, neighboring images in a video sequence may be similar (temporal redundancy) Redundancy
Human Perception Factors Compressed version of digital audio, image, video need not represent the original information exactly Perception sensitivities are different for different signal patterns Human eye is less sensitive to the higher spatial frequency components than the lower frequencies (transform coding)
Compression Principles  Source encoders and destination decoders Lossless and lossy compression  Entropy encoding  Source encoding
Source Encoders and destination decoders  Prior to transmitting the source information relating to a  multimedia application , a compression algorithm is applied to it. This implies that in order for the destination to reproduce the original source information or in some instances, a nearly exact copy of it  a matching decompression algorithm must be applied to it.  The application of the compression algorithm is the main function carried out by the  source encoder  and the decompression algorithm is carried out by the  destination decoder .
Source Encoders and destination decoders  In applications which involve two computers communicating with each other, the time required to perform the compression and decompression algorithm is not always critical.  So both algorithms are normally implemented in software within the two computers. Source  information Source  encoder  program  Destination  Decoder  Program Copy of  Source  Information Network Source encoder / destination decoder : Software only  Source Computer Destination Computer
Source Encoders and destination decoders  In other applications, however the time required to perform the  compression and decompression algorithms in software is not acceptable  and instead the two algorithms must be performed by special processors  in separate units . Source  information Source  encoder  Processor Destination  Decoder  Processor Copy of  Source  Information Network Source Computer Destination Computer Source encoder / destination decoder : Special  processors/hardware
Classification  Lossless compression lossless compression for legal and medical documents, computer programs exploit only data redundancy Lossy compression digital audio, image, video where some errors or loss can be tolerated exploit both data redundancy and human perception properties Constant bit rate versus variable bit rate coding
Lossless compression Lossless compression algorithm  the aim is to reduce the amount of source  Information to be transmitted in such a way that, when the compressed  information Is decompressed , there is no loss of information. Lossless compression is said therefore , to be reversible. An  example  application of lossless compression is for the transfer over a  network of a text file since , in such applications , it it normally imperative that  No part of the source information is lost during either the compression or  decompression operations.
Lossy compression The aim of lossy compression algorithm is normally not to reproduce an  exact copy of the source information after decomposition but rather a  version of it which is perceived by the recipient as a true copy (approx) Example : digitized images and audio and video streams. In such cases, the sensitivity of the human eye or ear is such that any fine  details that may be missing from the original source signal after  decompression are not detectable.
Entropy Encoding  Entropy encoding is lossless and independent of the  type of information that is being compressed .  It is concerned solely with how the information is represented.  Run-length Encoding  Typical applications of this type of encoding are when the source information  Comprises long substrings of the same character or binary digit. Instead of transmitting in the form of independent code words or bits , it is transmitted  in the form of a different set of codewords which  indicate bits as well as number of bits  In the substring. Example :  input  : 00000001111111110000011. output  : 0,7,1,10,0,5,1,2
Statistical Encoding   In this technique, patterns of bits (word) or that are more frequent are recorded  using shorter codes. It uses a set of variable length codewords  with the shortest codewords used to  represent the most frequently occurring symbols. For example :   In a string of text , the character A may occurs more frequently than say  the character P which occurs more frequently than the character Z , and so on  Statistical encoding exploits this property by using a set of variable length  Codewords With the shortest codewords used to represent the most frequently  occurring symbols.
Statistical encoding Two steps  identifying most frequent bit or byte patterns in data  coding these patterns with fewer bits than initially represented  Code-book  a table of correspondence between the initial patterns and their new  code tradeoff: more bits for other patterns Huffman encoding  frequency of occurrences calculated for each octet  optimal code table generated based upon frequency of occurrences  uniquely decipherable because of prefix property
In practice , the use of variable-length codewords is not quite as straight forward as it first appears. Clearly as with run-length encoding , the destination must know the set of codewords being used by the source.  With variable length codewords , however in order for the decoding operation to  be carried out correctly , it is necessary to ensure that a shorter codeword in the set does not form the start of a longer codeword otherwise the decoder will interpret the string on the wrong codeword boundaries. A codeword set that avoids this happening is said to process the prefix and an  Encoding scheme that generates codewords that have this property is the  Huffman encoding algorithm.
Huffman Encoding Different characters need not be encoded with the same number of bits.  With the help of the knowledge of frequency occurrence of characters, the Huffman encoding algorithm provides the more frequent characters with the code having lesser number of bits.   An Encoding scheme that generates code words that have prefix property is called the  HUFFMAN ENCODING ALGORITHM
Huffman Encoding Eg: Consider weights 2, 4, 6, 7, 7, 9. First the algorithms repeatedly combines the smallest two weights to obtain shorter and shorter weight sequences. 2 ,  4 , 6, 7, 7, 9  replaces 2 and 4 by 2 + 4 and calls for 6 ,  6 , 7, 7, 9 which replaces 6 and 6 by 12 and calls for 7 ,  7 , 9, 12 which calls for 9 ,  12 , 14 which calls for  14, 21
The character string to be transmitted is first analyzed and character types and their relative  frequency determined .    Create an unbalanced tree with some branches shorter than others  The degree is a function of the relative frequency of occurrence of the  characters :  - larger spread  unbalanced tree     This is known as Huffman code tree
Huffman code tree  It is binary tree  Branches assigned value 0 or 1  Root node  base of the tree  Branch node  Point at which branch divides Leaf node  termination point
Static Huffman Encoding 7  00 7  01 9  10 6  111 2  1100 4  1101 14 21 9 12 7 7 6 6 2 4 0 1 0 0 0 0 1 1 1 1
Decoding Algorithm  End  Is codeword already stored ? Begin  Set CODEWORD to empty Read next bit from BITSTREAM and append to  existing bits in CODEWORD Load matching ASCII Character into Receive _buffer  All bits in BITSTREAM Processed n n
Dynamic Huffman Coding  Previous method requires both transmitter and the receiver to know the table of codewords relating to the data being transmitted . In this method encoder and decoder build the Huffman tree , hence the  codeword table dynamically. If the character to be transmitted is currently present in the tree its  codeword is determined and sent in the normal way . If the character is not present then it is transmitted in its uncompressed  Form. Encoder updates Huffman tree either by increasing the frequency of the  occurrence or introducing a new character. Transmitted codeword is encoded in such a way that the receiver will be  able to determine the character being received and also carry out some  Modifications to its own copy of the tree  for new updated tree structure.
油
Arithmetic Coding  More complicated than Huffman coding  The first step is to divide the numeric range from 0 to 1 into a number of  different characters present in the message to be sent ( even termination  character) and the size of each segment by the probability of the related characters  The width of each segment being determined by the probability of related character  In static mode , the decoder knows the set  of characters that are present in the  encoded messages it receives as well as the segment to which each character has been assigned and its related range  Hence decoder follows same procedure as that of encoder  Complete message  - fragmented into multiple smaller strings  Each is encoded separately.  Resulting set of codewords sent as a block of floating point numbers each in a known format

More Related Content

Chapter%202%20 %20 Text%20compression(2)

  • 1. Text Compression Chapter 2
  • 2. To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video). Why Compress ? In almost all multimedia applications, a technique known as compression is first applied to the source information prior to its Transmission. This is done either to reduce the volume of information to be transmitted text , fax and images or to reduce the bandwidth that is required for its transmission speech , audio and video.
  • 3. Compression How is compression possible? Redundancy in digital audio, image, and video data Properties of human perception Digital audio is a series of sample values image is a rectangular array of pixel values video is a sequence of images played out at a certain rate Neighboring sample values are correlated
  • 4. Adjacent audio samples are similar (predictive encoding) samples corresponding to silence (silence removal) In digital image, neighboring samples on a scanning line are normally similar (spatial redundancy) In digital video, in addition to spatial redundancy, neighboring images in a video sequence may be similar (temporal redundancy) Redundancy
  • 5. Human Perception Factors Compressed version of digital audio, image, video need not represent the original information exactly Perception sensitivities are different for different signal patterns Human eye is less sensitive to the higher spatial frequency components than the lower frequencies (transform coding)
  • 6. Compression Principles Source encoders and destination decoders Lossless and lossy compression Entropy encoding Source encoding
  • 7. Source Encoders and destination decoders Prior to transmitting the source information relating to a multimedia application , a compression algorithm is applied to it. This implies that in order for the destination to reproduce the original source information or in some instances, a nearly exact copy of it a matching decompression algorithm must be applied to it. The application of the compression algorithm is the main function carried out by the source encoder and the decompression algorithm is carried out by the destination decoder .
  • 8. Source Encoders and destination decoders In applications which involve two computers communicating with each other, the time required to perform the compression and decompression algorithm is not always critical. So both algorithms are normally implemented in software within the two computers. Source information Source encoder program Destination Decoder Program Copy of Source Information Network Source encoder / destination decoder : Software only Source Computer Destination Computer
  • 9. Source Encoders and destination decoders In other applications, however the time required to perform the compression and decompression algorithms in software is not acceptable and instead the two algorithms must be performed by special processors in separate units . Source information Source encoder Processor Destination Decoder Processor Copy of Source Information Network Source Computer Destination Computer Source encoder / destination decoder : Special processors/hardware
  • 10. Classification Lossless compression lossless compression for legal and medical documents, computer programs exploit only data redundancy Lossy compression digital audio, image, video where some errors or loss can be tolerated exploit both data redundancy and human perception properties Constant bit rate versus variable bit rate coding
  • 11. Lossless compression Lossless compression algorithm the aim is to reduce the amount of source Information to be transmitted in such a way that, when the compressed information Is decompressed , there is no loss of information. Lossless compression is said therefore , to be reversible. An example application of lossless compression is for the transfer over a network of a text file since , in such applications , it it normally imperative that No part of the source information is lost during either the compression or decompression operations.
  • 12. Lossy compression The aim of lossy compression algorithm is normally not to reproduce an exact copy of the source information after decomposition but rather a version of it which is perceived by the recipient as a true copy (approx) Example : digitized images and audio and video streams. In such cases, the sensitivity of the human eye or ear is such that any fine details that may be missing from the original source signal after decompression are not detectable.
  • 13. Entropy Encoding Entropy encoding is lossless and independent of the type of information that is being compressed . It is concerned solely with how the information is represented. Run-length Encoding Typical applications of this type of encoding are when the source information Comprises long substrings of the same character or binary digit. Instead of transmitting in the form of independent code words or bits , it is transmitted in the form of a different set of codewords which indicate bits as well as number of bits In the substring. Example : input : 00000001111111110000011. output : 0,7,1,10,0,5,1,2
  • 14. Statistical Encoding In this technique, patterns of bits (word) or that are more frequent are recorded using shorter codes. It uses a set of variable length codewords with the shortest codewords used to represent the most frequently occurring symbols. For example : In a string of text , the character A may occurs more frequently than say the character P which occurs more frequently than the character Z , and so on Statistical encoding exploits this property by using a set of variable length Codewords With the shortest codewords used to represent the most frequently occurring symbols.
  • 15. Statistical encoding Two steps identifying most frequent bit or byte patterns in data coding these patterns with fewer bits than initially represented Code-book a table of correspondence between the initial patterns and their new code tradeoff: more bits for other patterns Huffman encoding frequency of occurrences calculated for each octet optimal code table generated based upon frequency of occurrences uniquely decipherable because of prefix property
  • 16. In practice , the use of variable-length codewords is not quite as straight forward as it first appears. Clearly as with run-length encoding , the destination must know the set of codewords being used by the source. With variable length codewords , however in order for the decoding operation to be carried out correctly , it is necessary to ensure that a shorter codeword in the set does not form the start of a longer codeword otherwise the decoder will interpret the string on the wrong codeword boundaries. A codeword set that avoids this happening is said to process the prefix and an Encoding scheme that generates codewords that have this property is the Huffman encoding algorithm.
  • 17. Huffman Encoding Different characters need not be encoded with the same number of bits. With the help of the knowledge of frequency occurrence of characters, the Huffman encoding algorithm provides the more frequent characters with the code having lesser number of bits. An Encoding scheme that generates code words that have prefix property is called the HUFFMAN ENCODING ALGORITHM
  • 18. Huffman Encoding Eg: Consider weights 2, 4, 6, 7, 7, 9. First the algorithms repeatedly combines the smallest two weights to obtain shorter and shorter weight sequences. 2 , 4 , 6, 7, 7, 9 replaces 2 and 4 by 2 + 4 and calls for 6 , 6 , 7, 7, 9 which replaces 6 and 6 by 12 and calls for 7 , 7 , 9, 12 which calls for 9 , 12 , 14 which calls for 14, 21
  • 19. The character string to be transmitted is first analyzed and character types and their relative frequency determined . Create an unbalanced tree with some branches shorter than others The degree is a function of the relative frequency of occurrence of the characters : - larger spread unbalanced tree This is known as Huffman code tree
  • 20. Huffman code tree It is binary tree Branches assigned value 0 or 1 Root node base of the tree Branch node Point at which branch divides Leaf node termination point
  • 21. Static Huffman Encoding 7 00 7 01 9 10 6 111 2 1100 4 1101 14 21 9 12 7 7 6 6 2 4 0 1 0 0 0 0 1 1 1 1
  • 22. Decoding Algorithm End Is codeword already stored ? Begin Set CODEWORD to empty Read next bit from BITSTREAM and append to existing bits in CODEWORD Load matching ASCII Character into Receive _buffer All bits in BITSTREAM Processed n n
  • 23. Dynamic Huffman Coding Previous method requires both transmitter and the receiver to know the table of codewords relating to the data being transmitted . In this method encoder and decoder build the Huffman tree , hence the codeword table dynamically. If the character to be transmitted is currently present in the tree its codeword is determined and sent in the normal way . If the character is not present then it is transmitted in its uncompressed Form. Encoder updates Huffman tree either by increasing the frequency of the occurrence or introducing a new character. Transmitted codeword is encoded in such a way that the receiver will be able to determine the character being received and also carry out some Modifications to its own copy of the tree for new updated tree structure.
  • 24.
  • 25. Arithmetic Coding More complicated than Huffman coding The first step is to divide the numeric range from 0 to 1 into a number of different characters present in the message to be sent ( even termination character) and the size of each segment by the probability of the related characters The width of each segment being determined by the probability of related character In static mode , the decoder knows the set of characters that are present in the encoded messages it receives as well as the segment to which each character has been assigned and its related range Hence decoder follows same procedure as that of encoder Complete message - fragmented into multiple smaller strings Each is encoded separately. Resulting set of codewords sent as a block of floating point numbers each in a known format