The sentinel prefix-code counting lemma for Krieger's finite generator theorem (M2→C3) #
This file proves the self-contained combinatorial heart of the symbolic-coding step (C3) of
Krieger's finite generator theorem (issue #15): the sentinel code. The classical Krieger
construction (Downarowicz, Entropy in Dynamical Systems, §4.2 Lemma 4.2.5 / Exercise 3.8; Shields,
The Ergodic Theory of Discrete Sample Paths, on strongly-separated codes; Lind–Marcus, Symbolic
Dynamics and Coding, on prefix codes) needs to inject the (≤ kᴺ) N-names of a generator into
short blocks over a small Fin l alphabet, so that a decoder reading the raw code stream can
re-find the cutting points between successive blocks. The standard device is a sentinel: a fixed
symbol s : Fin l reserved as a terminator and used nowhere else inside a block. Then in any
concatenation of such blocks the sentinel marks exactly the block boundaries, so the stream is
uniquely decodable.
We formalize the construction in its sharpest, most reusable form, decoupled from any dynamics:
The code map
sentinelEncode emb ssends a data wordd : List (Fin (l - 1))(the "name" digits) to the blockd.map emb ++ [s], whereemb : Fin (l - 1) ↪ {a : Fin l // a ≠ s}is any embedding of the data alphabet into the non-sentinel letters. The block has length|d| + 1and ends in the sentinel.Counting / injection.
sentinelEncodeis injective (sentinelEncode_injective), and its fixed-length variantsentinelEncodeFn(sentinelEncodeFn_injective) embeds the length-mdata wordsFin m → Fin (l-1)into the length-(m+1)sentinel blocks; there are(l-1)ᵐof them (card_dataWord,card_nonSentinel). Hence any name set of size≤ (l-1)ᵐinjects into the blocks (exists_sentinelEncoding), with a log-count boundkᴺ ≤ (l-1)ᵐ ⇔ N·log k ≤ m·log(l-1)(pow_le_pow_iff_log) — i.e. blocks of lengthm + 1 = O(N)suffice wheneverlog k < log(l-1).Decodability (prefix-free / comma-free). The defining structural property: a
sentinelEncodeblock contains the sentinel only at its last position (sentinel_count_eq_one,notMem_sentinelData). Hence in a concatenation of blocks (sentinelEncodeList) the sentinels are exactly the block ends, and the decoder recovers the block decomposition by splitting at the sentinels (sentinelEncodeList_injective, the unique-decodability statement C3 consumes).
The interface C3 consumes #
The deliverable for the next wave is exists_sentinelEncoding: for an alphabet Fin l with a
reserved sentinel s and any finite name set Name, if Fintype.card Name fits
in (l-1)ᵐ (in particular whenever Name = Fin k-names of length N, card Name ≤ kᴺ, and
N · log k ≤ m · log(l-1), the pow_le_pow_iff_log regime that needs two free symbols plus a
sentinel, 2 ≤ l - 1), there is an injection enc : Name ↪ List (Fin l) whose images are
length-(m+1) sentinel blocks (each with count s = 1) and whose concatenations are uniquely
decodable (sentinelEncodeList_injective): distinct name-streams give distinct code-streams. This
is exactly the symbolic code the column-coding partition is read off from.
References #
- Tomasz Downarowicz, Entropy in Dynamical Systems, Cambridge (2011), §4.2 (Lemma 4.2.5) and Exercise 3.8 (the sentinel/marker coding of names).
- Paul C. Shields, The Ergodic Theory of Discrete Sample Paths, GSM 13, AMS (1996), §I.9 (strongly-separated / marker codes).
- Douglas Lind and Brian Marcus, An Introduction to Symbolic Dynamics and Coding, Cambridge (1995), §8 (prefix codes, unique decodability).
The non-sentinel sub-alphabet: the letters of Fin l other than the reserved sentinel s.
A sentinelEncode block uses only these for its data symbols, so the sentinel s can mark block
boundaries unambiguously.
Instances For
A data symbol (a non-sentinel letter) is, as a letter of Fin l, different from the
sentinel.
The sentinel encoding of a single data word #
The sentinel encoding of one data word. Given an embedding emb of the data alphabet
Fin (l - 1) into the non-sentinel letters and the sentinel s, a data word
d : List (Fin (l - 1)) is coded as d.map (fun i => emb i) ++ [s]: its letters mapped into the
non-sentinel alphabet, then terminated by the sentinel. The block has length |d| + 1 and ends in
s, which occurs nowhere else in the block (sentinel_count_eq_one) — the prefix-free /
comma-free property that makes a concatenation of blocks uniquely decodable.
Equations
Instances For
The data part d.map emb of a sentinelEncode block contains no sentinel: every mapped
letter is a non-sentinel letter, hence ≠ s. This is the structural core of decodability.
The sentinel occurs exactly once in a block: at the very end. The data part contributes no
sentinel (notMem_sentinelData) and the trailing [s] contributes exactly one. Counting the
sentinel therefore yields 1 — the marker that a decoder uses to find the block boundary.
Injectivity of the single-word sentinel encoding. The block d.map emb ++ [s] determines
the data word d: stripping the trailing sentinel recovers d.map emb, and emb is injective.
Hence distinct names map to distinct blocks — the injection C3 needs to faithfully receive names.
Unique decodability of a concatenation of sentinel blocks #
The sentinel encoding of a stream of data words. A list of data words
ds : List (List (Fin (l - 1))) (a sequence of N-names) is coded by concatenating the individual
sentinel blocks: (ds.map (sentinelEncode s emb)).flatten. The decoder re-finds the cutting points
because the sentinel s occurs only at the end of each block (sentinel_count_eq_one), so the
code stream is uniquely decodable (sentinelEncodeList_injective).
Equations
- Oseledets.Krieger.sentinelEncodeList s emb ds = (List.map (Oseledets.Krieger.sentinelEncode s emb) ds).flatten
Instances For
The decoder splits a block-prefixed stream at the first sentinel: the data part. Reading a
stream sentinelEncode b ++ tail and taking the maximal prefix of non-sentinel letters recovers
exactly the data part b.map emb of the first block — because the data part has no sentinel and is
immediately followed by the terminating sentinel.
The decoder splits a block-prefixed stream at the first sentinel: the remaining stream.
After the maximal non-sentinel prefix, dropping the terminating sentinel returns exactly the rest of
the stream tail — so the decoder can recurse on the remaining blocks.
Unique decodability of the sentinel code (the C3 deliverable). The stream encoding
sentinelEncodeList s emb is injective: distinct sequences of names produce distinct code
streams. This is the prefix-free / comma-free property that lets a decoder reconstruct the sequence
of blocks — hence the sequence of names — from the raw Fin l-stream alone. The proof recurses: the
maximal non-sentinel prefix recovers the first block's data (takeWhile_sentinelEncode_append,
giving the first name by sentinelEncode_injective), and dropping it leaves the rest of the stream
(drop_sentinel_dropWhile_sentinelEncode_append) to recurse on.
Fixed-length names: the counting / injection layer #
For the symbolic code C3 the names are fixed-length words Fin m → Fin (l - 1) (e.g. the N-name
of a generator as a function of the N coordinates). Each such name is encoded as one sentinel
block via sentinelEncodeFn. The crux is a counting statement: there are exactly (l - 1)ᵐ
fixed-length data words, so any name set of cardinality ≤ (l - 1)ᵐ injects into the sentinel
blocks.
The non-sentinel sub-alphabet has l - 1 letters. Removing the single reserved sentinel s
from Fin l leaves l - 1 data letters. This is the per-symbol capacity of the code; the available
length-m data words number (l - 1)ᵐ (card_dataWord).
There are (l - 1)ᵐ fixed-length data words. The length-m words over the non-sentinel
alphabet are the functions Fin m → Fin (l - 1), of which there are (l - 1)ᵐ. This is the number
of distinct sentinel blocks of length m + 1 available to the encoder — the supply of code words
that must injectively receive all names.
The fixed-length sentinel encoder. A length-m name f : Fin m → Fin (l - 1) (a word over
the non-sentinel data alphabet) is coded as the single sentinel block sentinelEncode s emb (List.ofFn f): its m data letters mapped into the non-sentinel alphabet, then the terminating
sentinel. The resulting block has length m + 1.
Equations
- Oseledets.Krieger.sentinelEncodeFn s emb f = Oseledets.Krieger.sentinelEncode s emb (List.ofFn f)
Instances For
Injectivity of the fixed-length encoder. Distinct length-m names map to distinct blocks:
sentinelEncode is injective (sentinelEncode_injective) and List.ofFn is injective, so the
composite is.
A canonical embedding of the data alphabet into the non-sentinel letters. Since
NonSentinel s has exactly l - 1 letters (card_nonSentinel), it is in bijection with
Fin (l - 1); the inverse bijection is the embedding the encoder uses to place data symbols. (Any
embedding works for the structural results; this fixes a concrete one so the code is computable from
the data.)
Equations
Instances For
The headline counting injection (the C3 deliverable). For an alphabet Fin l with a
reserved sentinel s and any finite name set Name whose cardinality fits in the supply of
length-m data words, Fintype.card Name ≤ (l - 1) ^ m, there is an injection
enc : Name ↪ List (Fin l) of names into sentinel blocks of length m + 1 — each block ends
in the sentinel and contains it nowhere else, so the family is prefix-free and concatenated blocks
are uniquely decodable (sentinelEncodeList_injective).
Concretely enc is the composite Name ↪ (Fin m → Fin (l - 1)) (which exists because
card Name ≤ (l - 1) ^ m = card (Fin m → Fin (l - 1)), card_dataWord) followed by the injective
fixed-length encoder sentinelEncodeFn (sentinelEncodeFn_injective). This is exactly the symbolic
code the column-coding partition is read off from: when Real.log k < Real.log (l - 1) and
m · Real.log (l - 1) ≥ N · Real.log k, the ≤ kᴺ-many N-names inject into blocks of length
m + 1 = O(N).
The logarithmic length bound: blocks of length O(N) suffice #
The log-count bridge. Over a non-sentinel alphabet of size b := l - 1 ≥ 2, a name set of
size ≤ kᴺ fits in the length-m data words (l - 1)ᵐ iff N · log k ≤ m · log (l - 1) — the
information-theoretic length bound. In particular, picking m ≥ N · log k / log (l - 1) gives
blocks of length m + 1 = O(N) whenever log k < log (l - 1); this is the regime
Real.log k < Real.log (l - 1) the Krieger threshold log (alphabet) > h supplies. The statement
is the clean real-logarithm reformulation of the integer inequality kᴺ ≤ (l - 1)ᵐ.