[Jump to start of article]

Home > System Issues > Simple Lossless Compression

Simple Lossless Compression

Data compression conceptually follows three principles:

  1. Find repeating patterns in a file.
  2. Replace these patterns with a reference to a dictionary entry.
  3. Create a dictionary of the repeating patterns.

The reduction in file size comes about due to replacing repeating sequences by the dictionary reference. However the references add back an overhead by increasing the file size.

Lossless and lossy data compression both use this basic structure where they differ is in the methods they use to implement it. To illustrate this and the next three articles will deal with two lossless methods and two lossy ones.



Example Text for Compression.

Though text is not generally compressed, and even when it is lossy methods are not used, I will uses a sample of text to outline the basic methods employed in data compression. The text I will use is part of Churchill's "We shall fight on the beaches" wartime speech:

We_shall_go_on_to_the_end,_
we_shall_fight_in_France,_
we_shall_fight_on_the_seas_and_oceans,_
we_shall_fight_with_growing_confidence_and_growing_strength_in_the_air,_
we_shall_defend_our_Island,_whatever_the_cost_may_be,_
we_shall_fight_on_the_beaches,_
we_shall_fight_on_the_landing_grounds,_
we_shall_fight_in_the_fields_and_in_the_streets,_
we_shall_fight_in_the_hills;_
we_shall_never_surrender,

Data compression treats letters, numbers, punctuation and spaces all as characters. To make this clear the spaces in the quotation have been replaced replaced with underscores. The quotation has a total of 391 characters including spaces and punctuation.

Finding Repeating Patterns

Each variation for data compression follows a set of rules for finding repeating patterns. For "Simple Lossless" The rules I am applying are:

  1. Start by finding the largest sequence of characters that repeats at least once.
  2. Look at smaller sequences until you reach three character sequences.

The second rule stopping at three characters is to reduce the complexity of the example. If this was a real method then you would look at two character sequences.



Repetition of "We Shall Fight"

The first thing that strikes you in the example is the repetition of the phrase "we shall fight". If we analyse further we find five variations of this pattern:

Code Phrase Count Size Notes
1 ,_we_shall_fight_on_the_ 3 24 Lines 3, 6 and7
2 ,_we_shall_fight_in_the_ 2 24 Lines 8 and 9
3 ,_we_shall_fight_ 2 17 Lines 2 and 4
4 _we_shall_ 2 10 Lines 5 and 10
5 We_shall_ 1 9 Line 1

This will give us the core of our dictionary. The last example is by itself due to it starting with a capital letter.

Reference Dictionary

The search analysis is not bothered with meaning but only with finding repeating sequences of characters. So systematically applying the two rules to the text will produce this reference dictionary:

Code Phrase Count Size Notes
0 ,_we_shall_fight_on_the_ 3 24 Lines 3, 6 and 7
1 ,_we_shall_fight_in_the_ 3 24 Lines 8 and 9
2 ,_we_shall_fight_ 2 17 Lines 2 and 4
3 _we_shall_ 2 10 Lines 5 and 10, not line 1
4 _growing_ 2 9 Line 4
5 _in_the_ 2 8 Lines 4 and 8
6 ever_ 2 5 In whatever and never
7 _and 3 4 Lines 3, 4 and 8
8 the_ 2 4 Lines 1 and 5
9 and 2 3 In Island and Landing
* end 3 3 Line 1 and in defend and surrender
$ nce 2 3 In France and confidence

The heading "Code" is the code tag used to mark the position of the "Phrase" in the file."Count" is the number of repetitions of a sequence, and "Size" is the number of characters in it.

The "We shall" in the first line is not in the dictionary as it starts with a capital letter.

The table shows that any repeating pattern is searched for, such as "ncs" in "France" and "confidence", and not just words.

Patterns like "the" appear in four different places in the dictionary due to the rule of looking for long repeating patterns so the rule optimises long patterns at the expense of shorter ones.



Compressed File

Each of phrases is replaced with a tag resulting in the compressed file looking like this:

We_shall_go_on_to_8*
2in_Fra$
0seas7_oceans
2with4confide$74strength5air,
3def*_our_Isl9,_what68cost_may_be
0beaches
0l9ing_grounds
1fields75streets
1hills;

This looks very different from the original file and without the dictionary it would be impossible to read. However if you systematically replace the codes with the equivalent phrases in the dictionary the original file is restored. This is what lossless data compression means.

Compression Achieved

The file that is output after compression consists of the reduced file with the code tags inserted plus the dictionary of phrases and the codes that now represent them.

The original file consisted of 391 characters of which 259 have been taken out leaving 132 characters. However to mark the phases taken out 27 code tags have been inserted which means the file is 159 characters long.

There is also the dictionary which has 12 entries and hence code tags and 114 characters for the phrases giving a total for the dictionary of 126 characters.

Therefore the whole file is 285 characters long. Compared with the original file of 391 characters this is compressed to 73% of its original size.

This was straightforward but with more sophisticated rules there can be greater compression.

Understanding How The Techniques Work

To understand how the techniques work there are a series of articles on both techniques: