Documentation

Oseledets.Krieger.PrefixCode

The sentinel prefix-code counting lemma for Krieger's finite generator theorem (M2→C3) #

This file proves the self-contained combinatorial heart of the symbolic-coding step (C3) of Krieger's finite generator theorem (issue #15): the sentinel code. The classical Krieger construction (Downarowicz, Entropy in Dynamical Systems, §4.2 Lemma 4.2.5 / Exercise 3.8; Shields, The Ergodic Theory of Discrete Sample Paths, on strongly-separated codes; Lind–Marcus, Symbolic Dynamics and Coding, on prefix codes) needs to inject the (≤ kᴺ) N-names of a generator into short blocks over a small Fin l alphabet, so that a decoder reading the raw code stream can re-find the cutting points between successive blocks. The standard device is a sentinel: a fixed symbol s : Fin l reserved as a terminator and used nowhere else inside a block. Then in any concatenation of such blocks the sentinel marks exactly the block boundaries, so the stream is uniquely decodable.

We formalize the construction in its sharpest, most reusable form, decoupled from any dynamics:

The interface C3 consumes #

The deliverable for the next wave is exists_sentinelEncoding: for an alphabet Fin l with a reserved sentinel s and any finite name set Name, if Fintype.card Name fits in (l-1)ᵐ (in particular whenever Name = Fin k-names of length N, card Name ≤ kᴺ, and N · log k ≤ m · log(l-1), the pow_le_pow_iff_log regime that needs two free symbols plus a sentinel, 2 ≤ l - 1), there is an injection enc : Name ↪ List (Fin l) whose images are length-(m+1) sentinel blocks (each with count s = 1) and whose concatenations are uniquely decodable (sentinelEncodeList_injective): distinct name-streams give distinct code-streams. This is exactly the symbolic code the column-coding partition is read off from.

References #

The data alphabet: non-sentinel letters of Fin l #

@[reducible, inline]

The non-sentinel sub-alphabet: the letters of Fin l other than the reserved sentinel s. A sentinelEncode block uses only these for its data symbols, so the sentinel s can mark block boundaries unambiguously.

Equations
Instances For
    @[simp]
    theorem Oseledets.Krieger.NonSentinel.ne_sentinel {l : } {s : Fin l} (a : NonSentinel s) :
    a s

    A data symbol (a non-sentinel letter) is, as a letter of Fin l, different from the sentinel.

    The sentinel encoding of a single data word #

    def Oseledets.Krieger.sentinelEncode {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (d : List (Fin (l - 1))) :
    List (Fin l)

    The sentinel encoding of one data word. Given an embedding emb of the data alphabet Fin (l - 1) into the non-sentinel letters and the sentinel s, a data word d : List (Fin (l - 1)) is coded as d.map (fun i => emb i) ++ [s]: its letters mapped into the non-sentinel alphabet, then terminated by the sentinel. The block has length |d| + 1 and ends in s, which occurs nowhere else in the block (sentinel_count_eq_one) — the prefix-free / comma-free property that makes a concatenation of blocks uniquely decodable.

    Equations
    Instances For
      @[simp]
      theorem Oseledets.Krieger.sentinelEncode_length {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (d : List (Fin (l - 1))) :
      theorem Oseledets.Krieger.notMem_sentinelData {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (d : List (Fin (l - 1))) :
      sList.map (fun (i : Fin (l - 1)) => (emb i)) d

      The data part d.map emb of a sentinelEncode block contains no sentinel: every mapped letter is a non-sentinel letter, hence ≠ s. This is the structural core of decodability.

      @[simp]
      theorem Oseledets.Krieger.sentinel_count_eq_one {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (d : List (Fin (l - 1))) :

      The sentinel occurs exactly once in a block: at the very end. The data part contributes no sentinel (notMem_sentinelData) and the trailing [s] contributes exactly one. Counting the sentinel therefore yields 1 — the marker that a decoder uses to find the block boundary.

      Injectivity of the single-word sentinel encoding. The block d.map emb ++ [s] determines the data word d: stripping the trailing sentinel recovers d.map emb, and emb is injective. Hence distinct names map to distinct blocks — the injection C3 needs to faithfully receive names.

      Unique decodability of a concatenation of sentinel blocks #

      def Oseledets.Krieger.sentinelEncodeList {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (ds : List (List (Fin (l - 1)))) :
      List (Fin l)

      The sentinel encoding of a stream of data words. A list of data words ds : List (List (Fin (l - 1))) (a sequence of N-names) is coded by concatenating the individual sentinel blocks: (ds.map (sentinelEncode s emb)).flatten. The decoder re-finds the cutting points because the sentinel s occurs only at the end of each block (sentinel_count_eq_one), so the code stream is uniquely decodable (sentinelEncodeList_injective).

      Equations
      Instances For
        @[simp]
        @[simp]
        theorem Oseledets.Krieger.sentinelEncodeList_cons {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (b : List (Fin (l - 1))) (ds : List (List (Fin (l - 1)))) :
        theorem Oseledets.Krieger.takeWhile_sentinelEncode_append {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (b : List (Fin (l - 1))) (tail : List (Fin l)) :
        List.takeWhile (fun (a : Fin l) => decide (a s)) (sentinelEncode s emb b ++ tail) = List.map (fun (i : Fin (l - 1)) => (emb i)) b

        The decoder splits a block-prefixed stream at the first sentinel: the data part. Reading a stream sentinelEncode b ++ tail and taking the maximal prefix of non-sentinel letters recovers exactly the data part b.map emb of the first block — because the data part has no sentinel and is immediately followed by the terminating sentinel.

        theorem Oseledets.Krieger.drop_sentinel_dropWhile_sentinelEncode_append {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) (b : List (Fin (l - 1))) (tail : List (Fin l)) :
        (List.dropWhile (fun (a : Fin l) => decide (a s)) (sentinelEncode s emb b ++ tail)).tail = tail

        The decoder splits a block-prefixed stream at the first sentinel: the remaining stream. After the maximal non-sentinel prefix, dropping the terminating sentinel returns exactly the rest of the stream tail — so the decoder can recurse on the remaining blocks.

        Unique decodability of the sentinel code (the C3 deliverable). The stream encoding sentinelEncodeList s emb is injective: distinct sequences of names produce distinct code streams. This is the prefix-free / comma-free property that lets a decoder reconstruct the sequence of blocks — hence the sequence of names — from the raw Fin l-stream alone. The proof recurses: the maximal non-sentinel prefix recovers the first block's data (takeWhile_sentinelEncode_append, giving the first name by sentinelEncode_injective), and dropping it leaves the rest of the stream (drop_sentinel_dropWhile_sentinelEncode_append) to recurse on.

        Fixed-length names: the counting / injection layer #

        For the symbolic code C3 the names are fixed-length words Fin m → Fin (l - 1) (e.g. the N-name of a generator as a function of the N coordinates). Each such name is encoded as one sentinel block via sentinelEncodeFn. The crux is a counting statement: there are exactly (l - 1)ᵐ fixed-length data words, so any name set of cardinality ≤ (l - 1)ᵐ injects into the sentinel blocks.

        @[simp]

        The non-sentinel sub-alphabet has l - 1 letters. Removing the single reserved sentinel s from Fin l leaves l - 1 data letters. This is the per-symbol capacity of the code; the available length-m data words number (l - 1)ᵐ (card_dataWord).

        @[simp]
        theorem Oseledets.Krieger.card_dataWord {l : } (m : ) :
        Fintype.card (Fin mFin (l - 1)) = (l - 1) ^ m

        There are (l - 1)ᵐ fixed-length data words. The length-m words over the non-sentinel alphabet are the functions Fin m → Fin (l - 1), of which there are (l - 1)ᵐ. This is the number of distinct sentinel blocks of length m + 1 available to the encoder — the supply of code words that must injectively receive all names.

        def Oseledets.Krieger.sentinelEncodeFn {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) {m : } (f : Fin mFin (l - 1)) :
        List (Fin l)

        The fixed-length sentinel encoder. A length-m name f : Fin m → Fin (l - 1) (a word over the non-sentinel data alphabet) is coded as the single sentinel block sentinelEncode s emb (List.ofFn f): its m data letters mapped into the non-sentinel alphabet, then the terminating sentinel. The resulting block has length m + 1.

        Equations
        Instances For
          @[simp]
          theorem Oseledets.Krieger.sentinelEncodeFn_length {l : } (s : Fin l) (emb : Fin (l - 1) NonSentinel s) {m : } (f : Fin mFin (l - 1)) :
          (sentinelEncodeFn s emb f).length = m + 1

          Injectivity of the fixed-length encoder. Distinct length-m names map to distinct blocks: sentinelEncode is injective (sentinelEncode_injective) and List.ofFn is injective, so the composite is.

          noncomputable def Oseledets.Krieger.dataEmbedding {l : } (s : Fin l) :

          A canonical embedding of the data alphabet into the non-sentinel letters. Since NonSentinel s has exactly l - 1 letters (card_nonSentinel), it is in bijection with Fin (l - 1); the inverse bijection is the embedding the encoder uses to place data symbols. (Any embedding works for the structural results; this fixes a concrete one so the code is computable from the data.)

          Equations
          Instances For
            theorem Oseledets.Krieger.exists_sentinelEncoding {l : } {Name : Type u_1} [Fintype Name] {m : } (s : Fin l) (hcard : Fintype.card Name (l - 1) ^ m) :
            ∃ (enc : Name List (Fin l)), (∀ (x : Name), (enc x).length = m + 1) (∀ (x : Name), List.count s (enc x) = 1) Function.Injective enc

            The headline counting injection (the C3 deliverable). For an alphabet Fin l with a reserved sentinel s and any finite name set Name whose cardinality fits in the supply of length-m data words, Fintype.card Name ≤ (l - 1) ^ m, there is an injection enc : Name ↪ List (Fin l) of names into sentinel blocks of length m + 1 — each block ends in the sentinel and contains it nowhere else, so the family is prefix-free and concatenated blocks are uniquely decodable (sentinelEncodeList_injective).

            Concretely enc is the composite Name ↪ (Fin m → Fin (l - 1)) (which exists because card Name ≤ (l - 1) ^ m = card (Fin m → Fin (l - 1)), card_dataWord) followed by the injective fixed-length encoder sentinelEncodeFn (sentinelEncodeFn_injective). This is exactly the symbolic code the column-coding partition is read off from: when Real.log k < Real.log (l - 1) and m · Real.log (l - 1) ≥ N · Real.log k, the ≤ kᴺ-many N-names inject into blocks of length m + 1 = O(N).

            The logarithmic length bound: blocks of length O(N) suffice #

            theorem Oseledets.Krieger.pow_le_pow_iff_log {k N l m : } (hk : 0 < k) (hl : 1 < l - 1) :
            k ^ N (l - 1) ^ m N * Real.log k m * Real.log ↑(l - 1)

            The log-count bridge. Over a non-sentinel alphabet of size b := l - 1 ≥ 2, a name set of size ≤ kᴺ fits in the length-m data words (l - 1)ᵐ iff N · log k ≤ m · log (l - 1) — the information-theoretic length bound. In particular, picking m ≥ N · log k / log (l - 1) gives blocks of length m + 1 = O(N) whenever log k < log (l - 1); this is the regime Real.log k < Real.log (l - 1) the Krieger threshold log (alphabet) > h supplies. The statement is the clean real-logarithm reformulation of the integer inequality kᴺ ≤ (l - 1)ᵐ.