[Jump to start of article]

Home > System Issues > Lossy Compression (Part 1 of 2)

Lossy Compression (Pt 1)

An example of lossy data compression is the JPEG standard for storing pictures. The reason this standard is called "lossy" is because a picture can be saved into smaller and smaller files with on each occasion the image degrading with the structure still visible but the details getting lost. This means that when the file is recreated it is not identical to the original.

How does this happen? It is done by making similar types of pattern identical. The more aggressive the rules for making patterns similar then the smaller the file, but the larger the differences.

The JPEG standards and other standards are mathematically complex but use underlying basic principles. These principles can be understood by looking at how a text file would appear if logically similar techniques to lossy file compression are applied to it.

This article will emulate this by producing a lossy version of the Churchill war time speech "We shall fight on the beaches".

Lossy Repeating Patterns Rules

The key difference between Lossless and Lossy file compression is that in Lossy the exact sequence is not retained by near enough is used instead. For a computer this is done by applying rules in a similar manner to the two Lossless techniques. In fact Lossy techniques will retain exact patterns where possible.

The rules I am using for the Lossy technique are similar but different from the Complex Lossless.

  1. Start by finding the largest sequence of characters that nearly repeats at least once using these guidelines:
    • Ignore capitalisation when comparing strings.
    • Ignore changes in punctuation.
    • Ignore additional spaces.
    • Allow up to one character difference between sequences.
  2. Compare smaller sequences with larger ones to see if the larger one can be split following the same guidelines.
  3. Choose the most common sequence as the substitute for all the similar sequences.
  4. Look at smaller sequences until you reach three character sequences. (as before)
  5. Only accept a sequence if there it will result in an at least two character reduction in file size.

The fifth rule means that to be viable there must be 3 repetitions of a 3 or 4 character sequence. For all greater sequence sizes 2 repetitions are enough.

The "We Shall Fight" Sequences

Applying the new rules to the example text (see Simple Lossless) for the phrase "we shall fight" results in some considerable savings. These are the repetitions of the phrase:

Code Phrase Count Size Notes
1 ,_we_shall_fight_on_the_ 3 24 Lines 3, 6 and7
2 ,_we_shall_fight_in_the_ 2 24 Lines 8 and 9
3 ,_we_shall_fight_in_ 1 20 Line 2
4 ,_we_shall_fight_ 1 17 Line 4
5 ,_we_shall_ 1 11 Line 5
6 ;_we_shall_ 1 11 Line 10
7 We_shall_ 1 9 Line 1

Following the rules shows that first we find in three examples of the long sequence ",_we_shall_fight_on_the_". The next long sequence is two examples of ",_we_shall_fight_in_the_". Comparing the phrases shows they are similar with only one character difference. Therefore the first example will be the one used for all five instances.

The third phrase is ",_we_shall_fight_in_". The common factor of ",_we_shall_fight_" can be included in the previous entry the three characters of "in_" put back into the main text and the main phrase split.

The savings would be the 17 characters in the ",_we_shall_fight_" sequence. The cost would be a a new tag for the phrase and a new dictionary entry for "on_the_" of 7 characters plus 1 for the code and 5 tags in the text making a total of 14 characters. This shows it is worthwhile as overall 3 characters are saved which is above the target of at least 2.

The fourth phrase is ",_we_shall_fight_" just requires a tag in the text.

The fifth phrase is ",_we_shall_" consisting of 11 characters. The phrase ",_we_shall_fight_" can be split into two. If so it will require 1 character for the code for "fight_" and 8 characters for new tags in the text. This is a saving of 2 characters and is accepted.

The sixth phrase is ";_we_shall_" consisting of 11 characters with different punctuation. Following the rule the punctuation is set to the common version of a comma and is substituted.

Lastly the seventh variation "We_shall_" is included in the common sequence by ignoring the capital letter and including a space and a comma.

As a result of this process we have three phrases to include in the dictionary: ",_we_shall_", "fight_", and "on_the_".

The second part of this article deals with creating the compressed file and is at Lossy Compression (Part 2 of 2).

Understanding How The Techniques Work

To understand how the techniques work there are a series of articles on both techniques: