Compression

Compression classes in .NET work very well. Provided, you have to know what data it works well on.

  1. You can’t compress something that’s already compressed.
  2. If it’s too small, compression has added overhead that can’t overcome by subsequent savings in data stream.
  3. If you save it to disk, the disk file system has a cluster size which is minimum space a file can occupy anyway. So if what you’re compressing is too small, and you are saving to disk, it may be all for naught anyway.
  4. You get better performance with data w/ the same characters/bytes used over and over again.
  5. With a lot of data with a lot of distinct records, compression works better on the whole file once, than applying it several times to each records.

Now that you know that, you can understand the test results

Test data is created by appending Guid string to created the desired length…

Length(ratio)
# UncompressedLen After Deflate–>b64 After Ub64–>Gzip
1 1800 2588 (1.4378) 3788 (2.1044)
2 3600 3964 (1.1011) 5164 (1.4344)
3 5400 5212 (0.9652) 6412 (1.1874)
4 7200 6528 (0.9067) 7728 (1.0733)
5 9000 7784 (0.8649) 8984 (0.9982)
6 10800 9072 (0.8400) 10272 (0.9511)
7 12600 10360 (0.8222) 11560 (0.9175)
8 14400 11612 (0.8064) 12812 (0.8897)
9 16200 12920 (0.7975) 14120 (0.8716)
10 18000 14204 (0.7891) 15404 (0.8558)
11 19800 15468 (0.7812) 16668 (0.8418)
12 21600 16788 (0.7772) 17988 (0.8328)
13 23400 18044 (0.7711) 19244 (0.8224)
14 25200 19360 (0.7683) 20560 (0.8159)
15 27000 20644 (0.7646) 21844 (0.8090)
16 28800 21968 (0.7628) 23168 (0.8044)
17 30600 23280 (0.7608) 24480 (0.8000)
18 32400 24620 (0.7599) 25820 (0.7969)
19 34200 25948 (0.7587) 27148 (0.7938)
20 36000 27244 (0.7568) 28444 (0.7901)
21 37800 28564 (0.7557) 29764 (0.7874)
22 39600 29852 (0.7538) 31052 (0.7841)
23 41400 31212 (0.7539) 32412 (0.7829)
24 43200 32552 (0.7535) 33752 (0.7813)
25 45000 33880 (0.7529) 35080 (0.7796)

The compression is probably from the repeating ‘-‘ in the guid. And that the number of different bytes is limited to 17 (-0123456789abcdef). That’s why the compression ratio remains the same after a certain size. To show 17 different symbols, you need just 5 bits, instead of the 8 it normally occupies. It might even use less bits for the ‘-‘ bc it gets repeated the most. But before that size (<9000b) where the savings from the compression breaks-even with the added compression header, it's not even worth compressing.

How does individual English words compress?

New test data was made from Addresses from Microsoft’s Northwind database. So you have a more real world result to compare with. So what if you decided to store your address column in the database, compressed?

No UncompressedLen Deflate>b64_Len Ub64>Gzip_Len
1 5044 6920 (1.3719) 9080 (1.8002)
Avg Len / Address 56 76

If an address was comparable to a guid in its makeup, bc ‘-‘ are like spaces, then compare the average length of an address, to the compression performance/size in guid test section. There isn’t enough savings per character, to overcome the fixed cost of the compression header. Please, no stupid trolls about why would you do this in a database column, bc then you can’t run queries (you can, you just need to compress it first). This is a purely artificial test, to see the limits of use of compression technology in a transactional context

How does Xml compress?

Google Maps Geocoding format(more)…
<LocationResults xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Components>
    <AddressComponent>
      <Category>
        <string>street_number</string>
      </Category>
      <Long>1</Long>
      <Short>1</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>route</string>
      </Category>
      <Long>Brewers Way</Long>
      <Short>Brewers Way</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>locality</string>
        <string>political</string>
      </Category>
      <Long>Milwaukee</Long>
      <Short>Milwaukee</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>administrative_area_level_2</string>
        <string>political</string>
      </Category>
      <Long>Milwaukee County</Long>
      <Short>Milwaukee County</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>administrative_area_level_1</string>
        <string>political</string>
      </Category>
      <Long>Wisconsin</Long>
      <Short>WI</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>country</string>
        <string>political</string>
      </Category>
      <Long>United States</Long>
      <Short>US</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>postal_code</string>
      </Category>
      <Long>53214</Long>
      <Short>53214</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>postal_code_suffix</string>
      </Category>
      <Long>3655</Long>
      <Short>3655</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>CompositeLocality</string>
      </Category>
      <Long>Milwaukee, Milwaukee County, WI, US</Long>
      <Short>Milwaukee, Milwaukee County, WI, US</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>CompositeAdministrative_area_level_2</string>
      </Category>
      <Long>Milwaukee County, WI, US</Long>
      <Short>Milwaukee County, WI, US</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>CompositeAdministrative_area_level_1</string>
      </Category>
      <Long>WI, US</Long>
      <Short>WI, US</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>CompositeCountry</string>
      </Category>
      <Long>US</Long>
      <Short>US</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>App3</string>
      </Category>
      <Long>Miller Park, Milwaukee, MI</Long>
      <Short>Miller Park, Milwaukee, MI</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>App2</string>
      </Category>
      <Long>1 Brewers Way, Milwaukee, WI 53214</Long>
      <Short>1 Brewers Way, Milwaukee, WI 53214</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>short.userfailgeocode,long.suggestsuccess</string>
      </Category>
      <Long>1 Brewers Way, Milwaukee, WI 53214</Long>
      <Short>Miller Park, Milwaukee, MI</Short>
    </AddressComponent>
    <AddressComponent>
      <Category>
        <string>App1</string>
      </Category>
      <Long>1 Brewers Way, Milwaukee, WI 53214, USA</Long>
      <Short>1 Brewers Way, Milwaukee, WI 53214, USA</Short>
    </AddressComponent>
  </Components>
  <IsCached>true</IsCached>
  <Lat>43.028078</Lat>
  <Lon>-87.9712117</Lon>
  <Precision>ROOFTOP</Precision>
  <PlaceID>ChIJdQQ49FgaBYgRMG8K0UXaOfw</PlaceID>
  <Formatted>1 Brewers Way, Milwaukee, WI 53214, USA</Formatted>
  <FormattedAs>premise</FormattedAs>
</LocationResults>

Instead of compressing an address, what if you wanted to compress the geocoding data from Google, which is in xml format? Below is simulation of if Google Geocode in xml was stored compressed in your database.

No UncompressedLen Deflate>b64_Len Ub64>Gzip_Len
1 1,042,433 274,604 (0.2634) 283,436 (0.2719)
Avg Len / Geocode 2832 746

Leave a Reply

Your email address will not be published. Required fields are marked *