Compress Tools Library

scalion
God

Being bald is an art that only a god can master!

Posts: 571

Compress Tools Library Apr 27, 2020 22:02:23 GMT 1 infoliner, ventilo25, and 1 more like this

Quote

Post by scalion on Apr 27, 2020 22:02:23 GMT 1

I create it to use in my actual ultra-secret project. That provide compress/decompress using huffman codes.

The library source : CompressToolsLib.G32 (17.05 KB)

The compiled library : CompressToolsLib.lg32 (7.41 KB)

An example of using : Test CompressToolsLib.G32 (1.5 KB)

Have good day all.

- Update to add 2 new procedures : 01/05/2020

DictionnaryCompress(AdrTest, SizeTest, AdrC, CSize)
DictionnaryUncompress(AdrC, CSize, AdrU, USize)

For the moment this dictionnary dont use the Huffman codes but that work.

- Update to correcting bug (0xc00005) : 28/04/2020
Number of codes was written with value 1 to 256 that cause the problem with big data who need 256 codes
Now it writen-1 and read+1 : test ok

What is Huffman Code ?

Example : you have a string "abracadabra" that take 11 bytes, then 11*8 = 88 bits.

only letter "abrcd" are used. You remark that can be coded on 3 bits instead 8 bits, then 11*3 = 33 bits.

Go further : a appear 5 times, b and r 2 times, others 1 times.

You remark we can use only 1 bits for a and 3 for all others then whe have 5*1 + 6*3 = 23 bits !

It also mean that each code must not be a prefix of other longer code if we want to be able to read it.

a=0 b=110 r=111 d=100 c=101 then abracadabra give 0 110 111 0 101 0 100 0 110 111 0

That's what Huffman Code do !

In addition, the codes are optimized to the maximum so that the number of bits required per value is calculated so that the sum of the coded data is minimal

What is Dictionnay Compression ?

Example : again the string "abracadabra" who take 11 bytes, then 88 bits. "abra" appear 2 times, that mean we can use a code to say "repeat what found 9 bytes before for 4 bytes). We considere the data are a dictionnary and if a byte block appears more than 1 time, we code it with the offset (9 bytes before) and the block lenght (4 bytes).

In this case we indicate how many bits are necessary to store the offset and the lenght. We write this values with 4 bits (0 to 15) then we can indicate we need 1 to 16 bits to store each value of offset and lenght.

In case of "abra" with offset 9 and lenght 4 with use only 15 bits instead 32 bits of "abra":

Necessary Bits to store Offset	Value of offset	Necessary Bits to store Lenght	Value of Lenght
4 (we write 3)	9	3 (we write 2)	4
0011	1001	0010	100

The 4 bits used to store offset and lenght implicate the max value is 65536. Therefore the dictionary must be found in the 65536 bytes juste before the data pointer to be compressed or uncompressed. In practice we call this a "sliding windows". The most popular implementation of dictionnary compression is the Lempel-Ziv-Oberhumer algorithm (LZO) and its derivatives LZ77 LZ78 etc...

Why this library ?

Huffman codes is a standard compression used conjointly with many compression zlib,zip,jpeg etc...

Huffman dont really compress data, just use necessary memory to store information, that need the list of codes in header data.

To perform a real compression you will need to arrange data with some algorythm before use Huffman codes.

For example you can us simple repeat detection, or dictionnary compression.

Limitation :

This library is simply written for max 256 values (bytes data) that's what i need. But in fact a real Huffman Codes can work on more than 256. I will maybe write an extension to manage it in future.

How to use this library ?

- To encode use proc HuffmanCodesCompress(Adr,Size,AdrCompressed,SizeCompressed)

Indicate the adress ans size of data to encode then AdrCompressed and SizeCompressed are updated with the memory location of encoded data

- To decode use Proc HuffmanCodesUncompress(Adr,Size,AdrUncompressed,SizeUncompressed)

Indicate the adress ans size of data to decode then AdrUncompressed and SizeUncompressed are updated with the memory location of decoded data

- To retrieve informations on encoded data use function Long Var=HuffmanCodesGet(Adr%, Info%, Optional Index%)
You pass the adress of encoded data and what you want to retriece with Info%.

When optional Index is expected note it's between 0 and the number of index -1.

- Info% can take one of this value :

Constant Name	Value	Description
CHC_VALUES	1	Return the number of entries in Huffman codes table
CHC_VALUE	2	Return the value of entries Index% (0 to Values-1)
CHC_CODE_LENGHT	3	Return the Huffman code lenght of entries Index% (0 to Values-1)
CHC_CODE	4	Return the Huffman code (in long value) of entries Index% (0 to Values-1)
CHC_HEADER_BIT_SIZE	5	Return the size in bits of Huffman codes informations
CHC_UNCOMPRESSED_SIZE	6	Return the size of original data
CHC_DATA_BIT_SIZE	7	Return the size in bits of encoded data

Last Edit: May 1, 2020 21:39:58 GMT 1 by scalion

Nicolas Rey (Sorry for my poor english i speak french in fact)

infoliner New Member Posts: 26	Compress Tools Library May 10, 2020 17:30:01 GMT 1 scalion likes this Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by infoliner on May 10, 2020 17:30:01 GMT 1 Thank you scalion. Why dont you use the GB32 compress command? And, are you working on a more efficient compression code as well? And last question, what is your super-secret project about?

scalion
God

Being bald is an art that only a god can master!

Posts: 571

Compress Tools Library Jan 20, 2021 13:32:05 GMT 1

Quote

Post by scalion on Jan 20, 2021 13:32:05 GMT 1

May 10, 2020 17:30:01 GMT 1 infoliner said:

Thank you scalion. Why dont you use the GB32 compress command? And, are you working on a more efficient compression code as well? And last question, what is your super-secret project about?

Hi Infoliner,

I try perform a compression working ultra strong, but it's not universal, it often depends on the type of input data, sometime i have best result , sometime it's equal or less...

I think a good way to have an good algorythm is to identify the types of data (text, image, sounds, etc.), refer to a static database representative of the type identified and use a header of compressed data with high describing, and use a short header for small files to lose a minimum of compression. Well... my ultra-secret project is an help editor, compilator for android and new image format...

Nicolas Rey (Sorry for my poor english i speak french in fact)

Post by scalion on Apr 27, 2020 22:02:23 GMT 1

Post by infoliner on May 10, 2020 17:30:01 GMT 1

Post by scalion on Jan 20, 2021 13:32:05 GMT 1