Lossy Compression (Pt 2)
The first part of this article dealt with part of creating the dictionary and is at Lossy Compression (Part 1 of 2).
Other Sequences
We can now check for other phrases and the following sequences for consideration show up.
Code | Phrase | Count | Size | Notes |
---|---|---|---|---|
8 | _growing_ | 2 | 9 | Line 4 |
9 | _in_the_ | 2 | 8 | Lines 4 and 8 |
10 | ever_ | 2 | 5 | In whatever and never |
11 | _and_ | 2 | 5 | Lines 3 and 8 |
12 | _and | 1 | 4 | Line 4 |
13 | the_ | 2 | 4 | Lines 1 and 5 |
14 | and | 2 | 3 | In Island and Landing |
15 | end | 3 | 3 | Line 1 and in defend and surrender |
16 | nce | 2 | 3 | In France and confidence |
Checking the rest of the sequences shows that "_growing_" and "ever_" remain. The sequence "end" remains as it has 3 entries whereas "nce" is removed as it only has 2 entries.
The sequences "_in_the_" and "on_the_" from the "we shall fight" reduction can be combined as 7 entries of "on_the_" resulting in a saving of 7 characters.
The sequence "the_" forms part of the sequence "on_the_". The "on_the_" sequence would save 4 characters in the dictionary but require 7 new code tags. As a result the "the_" sequence is removed.
This leaves the 2 entries of "_and_", 1 entry of "_and" and 2 entries of "and". These are combined into the most common form.
The Lossy Dictionary
The dictionary looks like this:
Code | Phrase | Count | Size | Notes |
---|---|---|---|---|
0 | ,_we_shall_ | 10 | 11 | Lines 1 to 10 |
1 | _growing_ | 2 | 9 | Line 4 |
2 | on_the_ | 7 | 7 | Lines 1, 2, 4, 6, 7, 8 and 9 |
3 | fight_ | 7 | 6 | Lines 2, 3, 4, 6, 7, 8 and 9 |
4 | ever_ | 2 | 5 | In whatever and never |
5 | _and_ | 5 | 5 | Lines 3, 4, 8 and in Island and Landing |
6 | end | 3 | 3 | Line 1 and in defend and surrender |
This is a smaller dictionary then the Complex Lossless with very little redundancy.
The Compressed File
Each of phrases is replaced with its tag code resulting in the compressed file looking like this:
0go_on_to_the_6
03in_France
032seas5oceans
03with1confidence51strength_2_air
0def6_our_Isl5,_what4the_cost_may_be
032beaches
032l5ing_grounds
032fields52streets
032hills
0n4surr6er,
The Restored File
This is very different from the original file however if the phrases are substituted for the tags then an approximate, but readable version of the original file is restored.
,_we_shall_go_on_to_the_end,_
we_shall_fight_in_France,_
we_shall_fight_on_the_seas_and_oceans,_
we_shall_fight_with_growing_confidence_and__growing_strength_ on_the__air,_
we_shall_defend_our_Isl_and_,_whatever_the_cost_may_be,_
we_shall_fight_on_the_beaches,_
we_shall_fight_on_the_l_and_ing_grounds,_
we_shall_fight_on_the_fields_and_on_the_streets,_
we_shall_fight_on_the_hills,_
we_shall_never_surrender,
The differences are highlighted in the text and are:
- In line 1 the start of the sentence has a space and comma before it and is not capitalised.
- In Line 4 there is an extra space between "and" and "growing".
- In line 4, 8 and 9 the four instances of "in" have been changed to "on".
- In line 5 there is a space in and after "Island".
- In line 7 there are two spaces in "landing".
It is still recognisable as the original text. In many cases people would not spot the difference as many believe the key line is one of:
- "we shall fight them on the beaches"
- "we will fight on the beaches"
- "we will fight them on the beaches".
All of which are wrong but have the same feel to them.
Compression Achieved
The file that is output after compression consists of the reduced file with the tags inserted plus the dictionary of phrases and the codes that now represent them. The figures shown in brackets are the equivalents for the Complex Lossless compression.
The original file consisted of 391 characters of which 255 (233) have been taken out leaving 136 (158) characters. This is 26 characters more than the Simple Lossless compression. To mark the phrase positions 36 (35) tags have been inserted which means the file is 172 (193) characters long. Thus this approach has resulted in a smaller file of 21 characters.
The dictionary is also smaller. It has 7 (8) entries and 46 (50) characters for the phrases giving a total for the dictionary of 53 (58) characters. This is due to the consolidation of similar entries in the dictionary.
Therefore the whole file is 225 (251) characters long. Compared with the original file of 391 characters this is compression of 58% (64%) of the original size.
Comparison with Lossless Techniques
Even with this text based example more compression has been obtained with very little loss in quality. However in practice lossy techniques are not applied to text due to the subtle changes in meaning that can occur. However the JPEG standard for pictures allows much greater compression levels and even at 10% of the original file very little is lost as people cannot perceive the differences.
Understanding How The Techniques Work
To understand how the techniques work there are a series of articles on both techniques:
- Data Compression is an overview of lossy and lossless data compression techniques.
- Simple Lossless Compression describes the basics for understanding data compression including finding patterns and creating dictionaries using a text based example.
- Complex Lossless Compression part 1 and part 2 covers how to create smaller lossless files, again using the same text example.