Move-to-front transform: Difference between revisions

Content deleted Content added
No edit summary
m →‎The transform: monospace
 
(10 intermediate revisions by 4 users not shown)
Line 3:
The '''move-to-front (MTF) transform''' is an [[code|encoding]] of [[data]] (typically a stream of [[byte]]s) designed to improve the performance of [[entropy encoding]] techniques of [[Data compression|compression]]. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression [[algorithm]].
 
This algorithm was first published by [[:ru:Рябко, Борис Яковлевич|Boris Ryabko]] under the name of "book stack" in 1980.<ref>{{cite journal |authorlast=Ryabko, |first=Boris Yakovlevich |author-link=:ru:Рябко, Борис Яковлевич |title=Data compression by means of a "book stack” |journal=Problems of Information Transmission |year=1980 |volume=16 |issue=4 |pages=265-269 |url=https://www.mathnet.ru/links/185b89a03327cb0466fb2846e55f6404/ppi1458.pdf |zbl=0466.94007}}</ref> Subsequently, it was rediscovered by J.K. Bentley et al. in 1986,<ref>{{cite journal|url=https://www.cs.cmu.edu/~sleator/papers/Adaptive-data-compression.htm|title=A Locally Adaptive Data Compression Scheme|author1first1=J.Jon L.Louis |last1=Bentley |author2author-link1=D.Jon D.Bentley (computer scientist) |first2=Daniel Dominic Kaplan |last2=Sleator |author3author-link2=R.Daniel E.Sleator |first3=Robert Endre |last3=Tarjan |author-link3= Robert Tarjan |author4first4=V. K. |last4=Wei |journal=Communications of the ACM|volume=29|issue=4|pages=320–330|year=1986|doi=10.1145/5684.5688|citeseerx=10.1.1.69.807|s2cid=5854590 }}</ref> as attested in the explanatory note.<ref>{{cite journal |author1last1=Ryabko, B.|first1=Boris Ya.Yakovlevich |author2author-link1=Horspool:ru:Рябко, Борис Яковлевич |first2=R. Nigel |author3last2=Horspool |author-link2=Nigel Horspool |first3=Gordon Villy |last3=Cormack, |author-link3=Gordon V.Cormack |title=Comments to: "A locally adaptive data compression scheme" by J. L. Bentley, D. D. Sleator, R. E. Tarjan and V. K. Wei.|journal=Comm. ACM|volume=30|year=1987|issue=9|pages=792–794|doi=10.1145/30401.315747|s2cid=16138142 |doi-access=free}}</ref>
 
==The transform==
Line 21:
1,1,13,1,1,1,0,0
 
{| class="wikitable" borderstyle="1font-family:monospace;"
|-
! Iteration
Line 80:
 
<syntaxhighlight lang="python">
# """mtfwiki.py"""
from typing import Tuple, Union
# Instead of always transmitting an "original" dictionary, it is simpler to just agree on an initial set.
# Here we use the 256 possible values of a byte:
common_dictionary = list(range(256))
 
from collections.abc import Generator, Iterable
def encode(plain_text: str) -> list[int]:
from itertools import chain
# Change to bytes for 256.
plain_text = plain_text.encode("utf-8")
 
# Changing the common dictionary is a bad idea. Make a copy.
dictionary = common_dictionary.copy()
 
class MoveToFront:
# Transformation
"""
compressed_text = list() # Regular array
rank>>> mtfwiki = 0MoveToFront()
>>> list(mtfwiki.encode("Wikipedia"))
[87, 105, 107, 1, 112, 104, 104, 3, 102]
>>> mtfwiki.decode([87, 105, 107, 1, 112, 104, 104, 3, 102])
'Wikipedia'
>>> list(mtfwiki.encode("wikipedia"))
[119, 106, 108, 1, 113, 105, 105, 3, 103]
>>> mtfwiki.decode([119, 106, 108, 1, 113, 105, 105, 3, 103])
'wikipedia'
"""
 
def __init__(self, common_dictionary: Iterable[int] = range(256)):
# Read in each character
for c in plain_text: """
rank = dictionary.index(c) # Find the rankInstead of thealways charactertransmitting inan the"original" dictionary [O(k)],
it is simpler to just agree on an initial set.
compressed_text.append(rank) # Update the encoded text
Here we use the 256 possible values of a byte.
"""
self.common_dictionary = common_dictionary
 
def encode(self, plain_text: str) -> Generator[int]:
# Update the dictionary [O(k)]
# Changing the common dictionary is a bad idea. Make a copy.pop(rank)
dictionary.insert(0, c= list(self.common_dictionary)
 
# Read in each character
return compressed_text # Return the encoded text
for c in plain_text.encode("latin-1"): # Change to bytes for 256.
# Find the rank of the character in the dictionary [O(k)]
rank = dictionary.index(c) # the encoded character
yield rank
 
# Update the dictionary [O(1)]
dictionary.pop(rank)
dictionary.insert(0, c)
 
def decode(self, compressed_data: Iterable[int]) -> str:
"""
Inverse function that recover the original text
"""
dictionary = list(self.common_dictionary)
plain_text = []
 
# Read in each rank in the encoded text
for rank in compressed_data:
# Read the character of that rank from the dictionary
plain_text.append(dictionary[rank])
 
# Update the dictionary
e = dictionary.pop(rank)
dictionary.insert(0, e)
 
return bytes(plain_text).decode("latin-1") # Return original string
</syntaxhighlight>
 
In this example we can see the MTF code taking advantage of the three repetitive {{code|i}}'s in the input word. The common dictionary here, however, is less than ideal since it is initialized with more commonly used [[ASCII]] printable characters put after little-used control codes, against the MTF code's design intent of keeping what's commonly used in the front. If one rotates the dictionary to put the more-used characters in earlier places, a better encoding can be obtained:
The inverse will recover the original text:
 
<syntaxhighlight lang="python">
def block32(x):
def decode(compressed_data: list[int]) -> str:
return (x + i for i in range(32))
compressed_text = compressed_data
dictionary = common_dictionary.copy()
plain_text = []
 
# Read in each rank in the encoded text
for rank in compressed_text:
# Read the character of that rank from the dictionary
plain_text.append(dictionary[rank])
 
class MoveToFrontMoreCommon(MoveToFront):
# Update the dictionary
"""
e = dictionary.pop(rank)
>>> mtfwiki = MoveToFrontMoreCommon()
dictionary.insert(0, e)
>>> list(mtfwiki.encode("Wikipedia"))
[55, 10, 12, 1, 17, 9, 9, 3, 7]
"""
 
def __init__(self):
return bytes(plain_text).decode("utf-8") # Return original string
super().__init__(
</syntaxhighlight>
chain( # Sort the ASCII blocks:
block32(ord("a") - 1), # first lowercase,
block32(ord("A") - 1), # then uppercase,
block32(ord("!") - 1), # punctuation/number,
block32(0), # the control code,
range(128, 256), # and finally the non-ASCII stuff
)
)
 
Example output:
<syntaxhighlight lang="pycon">
>>> import mtfwiki
>>> mtfwiki.encode('Wikipedia')
[87, 105, 107, 1, 112, 104, 104, 3, 102]
>>> mtfwiki.decode([119, 106, 108, 1, 113, 105, 105, 3, 103])
'wikipedia'
</syntaxhighlight>
 
if __name__ == "__main__":
In this example we can see the MTF code taking advantage of the three repetitive {{code|i}}'s in the input word. The common dictionary here, however, is less than ideal since it is initialized with more commonly used [[ASCII]] printable characters put after little-used control codes, against the MTF code's design intent of keeping what's commonly used in the front. If one rotates the dictionary to put the more-used characters in earlier places, a better encoding can be obtained:
import doctest
 
doctest.testmod()
<syntaxhighlight lang="pycon">
>>> import mtfwiki
>>> block32 = lambda x : [x + i for i in range(32)]
>>> # Sort the ASCII blocks: first lowercase, then uppercase, punctuation/number, and finally the control code and the non-ASCII stuff
>>> mtfwiki.common_dictionary = block32(0x60) + block32(0x40) + block32(0x20) + block32(0x00) + list(range(128, 256))
>>> mtfwiki.encode('Wikipedia')
[55, 10, 12, 1, 17, 9, 9, 3, 7]
</syntaxhighlight>
 
Line 166 ⟶ 189:
{{main | Self-organizing list#Move to Front Method (MTF) }}
 
* The term Move To Front (MTF) is also used in a slightly different context, as a type of a dynamic [[linked list]]. In an MTF list, each element is moved to the front when it is accessed.<ref>{{Cite journal | doi = 10.1145/359997.360000| title = On self-organizing sequential search heuristics| journal = Communications of the ACM| volume = 19| issue = 2| pages = 63–67| year = 1976| last1 = Rivest | first1 = R.Ronald Linn | author-link1=Ron Rivest | s2cid = 498886| doi-access = free}}</ref> This ensures that, over time, the more frequently accessed elements are easier to access.
 
==References==