際際滷

際際滷Share a Scribd company logo
Com S 229 Spring 2007 Dr. Markus Lumpe
1
Study Project: Data Compression
In 1952 David A. Huffman published the paper A Method for the Construction of
Minimum-Redundancy Codes in which he defined a greedy algorithm that constructs an
optimal prefix code called a Huffman code.
The algorithm begins with a set C of pairs (c,f), called HuffmanLeaf, where c is a
character of an underlying alphabet (e.g., ASCII or data type byte) and f denotes the
number of occurrences of c in a given source file. The set C is a partial order  over
pairs. Thus, if (t,2),(d,3)  C (i.e, (t,2), (d,3) are elements of C),
then we have that pair (t,2)  (d,3). Huffmans algorithm uses the frequency
f to identify the two least-frequent objects in C and merges them. The result of the
merger is a new object (i.e., HuffmanNode) whose frequency is the sum of the
frequency of the two objects being merged. In total, the algorithm performs a sequence
of |C| - 1 merging operations to create a Huffman tree.
How does it work?
Assume a file that contains 100 characters built out of 6 different letters with the
following frequency: a : 45, b : 13, c : 12, d : 16, e : 9, and f : 5. The Huffman
algorithm creates a HuffmanTree as follows:
Com S 229 Spring 2007 Dr. Markus Lumpe
2
How we need to construct the Huffman codes. An edge connecting an internal node
with its children is labeled 0 if it is an edge to the left child, and 1 if it is an edge to
the right child. The Huffman code for a character c is the sequence of labels on the
edges connecting the root to the leaf for that character.
Therefore, we encode:
a: 0
b: 101
c: 100
d: 111
e: 1101
f: 1100
The compression ratio can be calculated as follows. We start with the ASCII encoding.
Every character in the encoding requires 8 bits. Thus, a text containing 100 characters
has a size of 800 bits, or 100 Byte. The number of bits required using the calculated
Huffman codes is 45*1+13*3+12*3+16*3+9*4+5*4 = 45+39+36+48+36+20 = 244,
28 Byte, which yields a compression ratio of 72%.
Savings of 20% to 90% are typical, but not guaranteed. In fact, Huffman compression is
less effective that Lempel-Ziv compression.
Com S 229 Spring 2007 Dr. Markus Lumpe
3
Compression: Design idea
Huffman codes are constructed in two phases: (i) an analysis phase in which the whole
source file is read to construct a map<unsigned char, long> of pairs
(Character, Frequency), and (ii) a coding phase in which a Huffman tree is
constructed to determine the Huffman codes.
In order to implement Huffman compression you need to solve the following sub-
problems:
1. Frequency collection, build the set C:
Start with an empty map<unsigned char, long> frequencies. You need
to process the whole input file, read all characters, and count their occurrences,
that is, frequencies[c]++. Start with an empty map<unsigned char,
long>.
2. Huffman tree construction:
Construct a Huffman tree according to the Huffman algorithm. The best data
type for this purpose is set<T>, where T is the base type of the set. T should be
a super type (or interface) for Huffman leaves and Huffman nodes. Also, to use
the set, you need to define a proper < operator for Huffman leaves and nodes.
3. Huffman code assignment:
Annotate the edges of the Huffman tree with 0 or 1. That is, define a scheme to
assign every leaf-node its corresponding Huffman code.
4. Huffman character/code mapping:
Build a map<unsigned byte, HuffmanCode> that contains mappings from
characters to HuffmanCodes. HuffmanCode is the data type you chose to
represent Huffman codes. You need to traverse the Huffman tree and visit all
leafs. Use object-oriented techniques to solve this problem.
Assume as source files plain binary files. That is, a character is represented as a
unsigned char. Huffman codes have flexible length. To facilitate their representation
it might be usefule, but not required, to use the data type string. For example, the
character e is represented as 0x65 of type unsigned char and its corresponding
Huffman code as 1101 of type string.
Your solution will require several different classes. For example, you will need at least, a
class for the analysis phase, the class HuffmanLeaf to represent leaves, the class
HuffmanNode to represent nodes, and the class HuffmanTree to represent Huffman
trees.
To test your application you can use the file sample.txt, which contains the text from
problem set 2. This file is 1996 Bytes long and contains 53 different characters. The
expected compression ratio is 41.27%, that is, we can represent 15968 Bits (1996
Bytes) of fixed-size code by at most 9384 Bits (or 1173 Bytes) of variable-size prefix
codes.
Com S 229 Spring 2007 Dr. Markus Lumpe
4
Week 1
Study the Huffman algorithm. Search the Internet for additional information. Develop a
basic class structure. Familiarize yourself with the container types map and set. Start
with simple tests. Prepare a catalogue of questions that need to be discussed in class.
Week 2
Implement the frequency analysis and the construction of the Huffman tree.
Week 3
Implement the compression.
Week 4
Implement the decompression.
Week 5
Code cleanup.
Deliverable
An operational compression/decompression C++ console application.
We will discuss ideas, problems, and possible solutions in class.
Lab sessions: 03/20, 03/22, and 03/27
Final submission deadline: Thursday, April 19, 2004, 4:10 p.m.
Ad

Recommended

Huffman coding01
Huffman coding01
Nv Thejaswini
Huffman Codes
Huffman Codes
Md. Shafiuzzaman Hira
Huffman coding || Huffman Tree
Huffman coding || Huffman Tree
Gurunadh Guru
Huffman analysis
Huffman analysis
Abubakar Sultan
Data Structure and Algorithms Huffman Coding Algorithm
Data Structure and Algorithms Huffman Coding Algorithm
ManishPrajapati78
Lecture 7
Lecture 7
Mohammed Khan
Python strings
Python strings
Mohammed Sikander
Suffix Tree and Suffix Array
Suffix Tree and Suffix Array
Harshit Agarwal
Lecft3data
Lecft3data
ali Hussien
Strings
Strings
Saranya saran
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
Danial Virk
Python tutorial
Python tutorial
Andrei Tomoroga
Strings in Python
Strings in Python
nitamhaske
27.2.9 lab regular expression tutorial
27.2.9 lab regular expression tutorial
Freddy Buena単o
Python overview
Python overview
Hemant Kumar Tiwary
Strings in Python
Strings in Python
Amisha Narsingani
Strings in c mrs.sowmya jyothi
Strings in c mrs.sowmya jyothi
Sowmya Jyothi
Parse Tree
Parse Tree
A. S. M. Shafi
Introduction To Programming with Python-4
Introduction To Programming with Python-4
Syed Farjad Zia Zaidi
C# String
C# String
Raghuveer Guthikonda
Csharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressions
Abed Bukhari
Computer programming 2 Lesson 12
Computer programming 2 Lesson 12
MLG College of Learning, Inc
Maxbox starter20
Maxbox starter20
Max Kleiner
13string in c#
13string in c#
Sireesh K
Grep
Grep
Dr.M.Karthika parthasarathy
String handling
String handling
IyeTech - Pakistan
j001adcpresentation-2112170415 23.pdf
j001adcpresentation-2112170415 23.pdf
HarshSharma71048
Huffman Algorithm and its Application by Ekansh Agarwal
Huffman Algorithm and its Application by Ekansh Agarwal
Ekansh Agarwal
5c. huffman coding using greedy technique.pptx
5c. huffman coding using greedy technique.pptx
Suma Raj
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text Data
BRNSSPublicationHubI

More Related Content

What's hot (18)

Lecft3data
Lecft3data
ali Hussien
Strings
Strings
Saranya saran
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
Danial Virk
Python tutorial
Python tutorial
Andrei Tomoroga
Strings in Python
Strings in Python
nitamhaske
27.2.9 lab regular expression tutorial
27.2.9 lab regular expression tutorial
Freddy Buena単o
Python overview
Python overview
Hemant Kumar Tiwary
Strings in Python
Strings in Python
Amisha Narsingani
Strings in c mrs.sowmya jyothi
Strings in c mrs.sowmya jyothi
Sowmya Jyothi
Parse Tree
Parse Tree
A. S. M. Shafi
Introduction To Programming with Python-4
Introduction To Programming with Python-4
Syed Farjad Zia Zaidi
C# String
C# String
Raghuveer Guthikonda
Csharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressions
Abed Bukhari
Computer programming 2 Lesson 12
Computer programming 2 Lesson 12
MLG College of Learning, Inc
Maxbox starter20
Maxbox starter20
Max Kleiner
13string in c#
13string in c#
Sireesh K
Grep
Grep
Dr.M.Karthika parthasarathy
String handling
String handling
IyeTech - Pakistan

Similar to Huffman (20)

j001adcpresentation-2112170415 23.pdf
j001adcpresentation-2112170415 23.pdf
HarshSharma71048
Huffman Algorithm and its Application by Ekansh Agarwal
Huffman Algorithm and its Application by Ekansh Agarwal
Ekansh Agarwal
5c. huffman coding using greedy technique.pptx
5c. huffman coding using greedy technique.pptx
Suma Raj
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text Data
BRNSSPublicationHubI
Greedy Algorithms Huffman Coding.ppt
Greedy Algorithms Huffman Coding.ppt
Ruchika Sinha
DAA PPT.pptx
DAA PPT.pptx
SintooChauhan6
Huffman codes
Huffman codes
Nargis Ehsan
Huffman ppt
Huffman ppt
ALexHunter69
Huffman Coding is a technique of compressing data
Huffman Coding is a technique of compressing data
Kumari99
Huffman Algorithm for File Compression.pptx
Huffman Algorithm for File Compression.pptx
kazmijaffar890
Huffman coding || Huffman Tree
Huffman coding || Huffman Tree
SatishKumarInumarthi
Huffman Encoding Algorithm - Concepts and Example
Huffman Encoding Algorithm - Concepts and Example
MaryJacob24
Hufman coding basic
Hufman coding basic
radthees
Lec32
Lec32
anithabalaprabhu
Farhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman coding
Farhana Shaikh
DSA Presentetion Huffman tree.pdf
DSA Presentetion Huffman tree.pdf
GaneshPawar819187
hufman coding for compression algorithm.ppt
hufman coding for compression algorithm.ppt
alvishi254
hufman code presentation and how to compress data using hufman code
hufman code presentation and how to compress data using hufman code
ssuser6059cf
compression & huffman coder problem .ppt
compression & huffman coder problem .ppt
pradeepkumar465177
Huffman Text Compression Technique
Huffman Text Compression Technique
Universitas Pembangunan Panca Budi
j001adcpresentation-2112170415 23.pdf
j001adcpresentation-2112170415 23.pdf
HarshSharma71048
Huffman Algorithm and its Application by Ekansh Agarwal
Huffman Algorithm and its Application by Ekansh Agarwal
Ekansh Agarwal
5c. huffman coding using greedy technique.pptx
5c. huffman coding using greedy technique.pptx
Suma Raj
Implementation of Lossless Compression Algorithms for Text Data
Implementation of Lossless Compression Algorithms for Text Data
BRNSSPublicationHubI
Greedy Algorithms Huffman Coding.ppt
Greedy Algorithms Huffman Coding.ppt
Ruchika Sinha
Huffman Coding is a technique of compressing data
Huffman Coding is a technique of compressing data
Kumari99
Huffman Algorithm for File Compression.pptx
Huffman Algorithm for File Compression.pptx
kazmijaffar890
Huffman Encoding Algorithm - Concepts and Example
Huffman Encoding Algorithm - Concepts and Example
MaryJacob24
Hufman coding basic
Hufman coding basic
radthees
Farhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman coding
Farhana Shaikh
DSA Presentetion Huffman tree.pdf
DSA Presentetion Huffman tree.pdf
GaneshPawar819187
hufman coding for compression algorithm.ppt
hufman coding for compression algorithm.ppt
alvishi254
hufman code presentation and how to compress data using hufman code
hufman code presentation and how to compress data using hufman code
ssuser6059cf
compression & huffman coder problem .ppt
compression & huffman coder problem .ppt
pradeepkumar465177
Ad

Recently uploaded (20)

Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
20CE404-Soil Mechanics - 際際滷 Share PPT
20CE404-Soil Mechanics - 際際滷 Share PPT
saravananr808639
Engineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its Application
Sakthivel M
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
ravicivil
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
OCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdf
Muanisa Waras
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
aniket862935
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ijab2
David Boutry - Mentors Junior Developers
David Boutry - Mentors Junior Developers
David Boutry
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
Industry 4.o the fourth revolutionWeek-2.pptx
Industry 4.o the fourth revolutionWeek-2.pptx
KNaveenKumarECE
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
23Q95A6706
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
CenterEnamel
Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
20CE404-Soil Mechanics - 際際滷 Share PPT
20CE404-Soil Mechanics - 際際滷 Share PPT
saravananr808639
Engineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its Application
Sakthivel M
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
Machine Learning - Classification Algorithms
Machine Learning - Classification Algorithms
resming1
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
ravicivil
Stay Safe Women Security Android App Project Report.pdf
Stay Safe Women Security Android App Project Report.pdf
Kamal Acharya
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
OCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdf
Muanisa Waras
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
Development of Portable Biomass Briquetting Machine (S, A & D)-1.pptx
aniket862935
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ijab2
David Boutry - Mentors Junior Developers
David Boutry - Mentors Junior Developers
David Boutry
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
Industry 4.o the fourth revolutionWeek-2.pptx
Industry 4.o the fourth revolutionWeek-2.pptx
KNaveenKumarECE
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
Learning Types of Machine Learning Supervised Learning Unsupervised UNI...
23Q95A6706
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
Center Enamel can Provide Aluminum Dome Roofs for diesel tank.docx
CenterEnamel
Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
Ad

Huffman

  • 1. Com S 229 Spring 2007 Dr. Markus Lumpe 1 Study Project: Data Compression In 1952 David A. Huffman published the paper A Method for the Construction of Minimum-Redundancy Codes in which he defined a greedy algorithm that constructs an optimal prefix code called a Huffman code. The algorithm begins with a set C of pairs (c,f), called HuffmanLeaf, where c is a character of an underlying alphabet (e.g., ASCII or data type byte) and f denotes the number of occurrences of c in a given source file. The set C is a partial order over pairs. Thus, if (t,2),(d,3) C (i.e, (t,2), (d,3) are elements of C), then we have that pair (t,2) (d,3). Huffmans algorithm uses the frequency f to identify the two least-frequent objects in C and merges them. The result of the merger is a new object (i.e., HuffmanNode) whose frequency is the sum of the frequency of the two objects being merged. In total, the algorithm performs a sequence of |C| - 1 merging operations to create a Huffman tree. How does it work? Assume a file that contains 100 characters built out of 6 different letters with the following frequency: a : 45, b : 13, c : 12, d : 16, e : 9, and f : 5. The Huffman algorithm creates a HuffmanTree as follows:
  • 2. Com S 229 Spring 2007 Dr. Markus Lumpe 2 How we need to construct the Huffman codes. An edge connecting an internal node with its children is labeled 0 if it is an edge to the left child, and 1 if it is an edge to the right child. The Huffman code for a character c is the sequence of labels on the edges connecting the root to the leaf for that character. Therefore, we encode: a: 0 b: 101 c: 100 d: 111 e: 1101 f: 1100 The compression ratio can be calculated as follows. We start with the ASCII encoding. Every character in the encoding requires 8 bits. Thus, a text containing 100 characters has a size of 800 bits, or 100 Byte. The number of bits required using the calculated Huffman codes is 45*1+13*3+12*3+16*3+9*4+5*4 = 45+39+36+48+36+20 = 244, 28 Byte, which yields a compression ratio of 72%. Savings of 20% to 90% are typical, but not guaranteed. In fact, Huffman compression is less effective that Lempel-Ziv compression.
  • 3. Com S 229 Spring 2007 Dr. Markus Lumpe 3 Compression: Design idea Huffman codes are constructed in two phases: (i) an analysis phase in which the whole source file is read to construct a map<unsigned char, long> of pairs (Character, Frequency), and (ii) a coding phase in which a Huffman tree is constructed to determine the Huffman codes. In order to implement Huffman compression you need to solve the following sub- problems: 1. Frequency collection, build the set C: Start with an empty map<unsigned char, long> frequencies. You need to process the whole input file, read all characters, and count their occurrences, that is, frequencies[c]++. Start with an empty map<unsigned char, long>. 2. Huffman tree construction: Construct a Huffman tree according to the Huffman algorithm. The best data type for this purpose is set<T>, where T is the base type of the set. T should be a super type (or interface) for Huffman leaves and Huffman nodes. Also, to use the set, you need to define a proper < operator for Huffman leaves and nodes. 3. Huffman code assignment: Annotate the edges of the Huffman tree with 0 or 1. That is, define a scheme to assign every leaf-node its corresponding Huffman code. 4. Huffman character/code mapping: Build a map<unsigned byte, HuffmanCode> that contains mappings from characters to HuffmanCodes. HuffmanCode is the data type you chose to represent Huffman codes. You need to traverse the Huffman tree and visit all leafs. Use object-oriented techniques to solve this problem. Assume as source files plain binary files. That is, a character is represented as a unsigned char. Huffman codes have flexible length. To facilitate their representation it might be usefule, but not required, to use the data type string. For example, the character e is represented as 0x65 of type unsigned char and its corresponding Huffman code as 1101 of type string. Your solution will require several different classes. For example, you will need at least, a class for the analysis phase, the class HuffmanLeaf to represent leaves, the class HuffmanNode to represent nodes, and the class HuffmanTree to represent Huffman trees. To test your application you can use the file sample.txt, which contains the text from problem set 2. This file is 1996 Bytes long and contains 53 different characters. The expected compression ratio is 41.27%, that is, we can represent 15968 Bits (1996 Bytes) of fixed-size code by at most 9384 Bits (or 1173 Bytes) of variable-size prefix codes.
  • 4. Com S 229 Spring 2007 Dr. Markus Lumpe 4 Week 1 Study the Huffman algorithm. Search the Internet for additional information. Develop a basic class structure. Familiarize yourself with the container types map and set. Start with simple tests. Prepare a catalogue of questions that need to be discussed in class. Week 2 Implement the frequency analysis and the construction of the Huffman tree. Week 3 Implement the compression. Week 4 Implement the decompression. Week 5 Code cleanup. Deliverable An operational compression/decompression C++ console application. We will discuss ideas, problems, and possible solutions in class. Lab sessions: 03/20, 03/22, and 03/27 Final submission deadline: Thursday, April 19, 2004, 4:10 p.m.