<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Daniel Duan's Articles About Performance</title>
        <link>https://duan.ca/tag/performance/</link>
        <atom:link href="https://duan.ca/tag/performance/feed.xml" rel="self" type="application/rss+xml" />
            <item>
                <title>TOMLDecoder Is Now Faster Than C (Thanks to AI)</title>
                <description>&#60;p&#62;Recently,
I gave my TOML library written in Swift &#60;a href=&#34;/2025/12/10/TOMLDecoder-0.4.1/&#34;&#62;an 800% speed boost&#60;/a&#62;.
The natural question after that is:
how much faster can I push it?&#60;/p&#62;
&#60;details&#62;
&#60;summary&#62;
I&#39;m happy to report that TOMLDecoder now parses the &#60;a href=&#34;https://github.com/dduan/TOMLDecoder/blob/cea8f0bee33f37e0fcc33b566a742485c71196e7/Sources/Resources/fixtures/twitter.toml&#34;&#62;Twitter payload example&#60;/a&#62; 1.8x faster than the C library &#60;a href=&#34;https://github.com/cktan/tomlc99&#34;&#62;tomlc99&#60;/a&#62;, and 5x faster than &#60;a href=&#34;https://github.com/marzer/tomlplusplus&#34;&#62;TOML++&#60;/a&#62;.
&#60;/summary&#62;
&#60;p&#62;I tried to be as charitable as possible for the non-Swift libraries while keeping the call sites in Swift.
For example,
it takes time to create or copy the UTF-8 bytes of a &#60;code&#62;Swift.String&#60;/code&#62; into a contiguous region.
And that&#39;s not counted towards the other libraries&#39; parsing time.
TOML++ runs faster with exceptions enabled.
So that&#39;s the path I chose to benchmark.
When bridging the C++ code,
I made sure there&#39;s no allocation,
no checking for input/output, etc,
so that the bridging overhead is trivial.&#60;/p&#62;
&#60;p&#62;Here&#39;s the benchmark code run repeatedly to collect an average,
with warmups ahead of time:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;func benchmarkTOMLDecoder(source: String) throws -&#38;gt; Double {
    let start = CFAbsoluteTimeGetCurrent()
    let table = try TOMLTable(source: source)
    let end = CFAbsoluteTimeGetCurrent()
    blackhole(table)
    return end - start
}

func benchmarkCTOML99(source: String) -&#38;gt; Double {
    var source = source
    var duration: Double = 0
    source.withUTF8 {
        $0.withMemoryRebound(to: CChar.self) { buffer in
            let baseAddress = UnsafeMutableRawPointer(mutating: buffer.baseAddress!)
            let start = CFAbsoluteTimeGetCurrent()
            let table = toml_parse(baseAddress, nil, 0)
            duration = CFAbsoluteTimeGetCurrent() - start
            blackhole(table)
        }
    }
    return duration
}

func benchmarkCTOMLPlusPlus(source: String) -&#38;gt; Double {
    var source = source
    var duration: Double = 0
    source.withUTF8 {
        $0.withMemoryRebound(to: CChar.self) { buffer in
            let start = CFAbsoluteTimeGetCurrent()
            let table = tomlpp_parse(buffer.baseAddress, buffer.count)
            duration = CFAbsoluteTimeGetCurrent() - start
            blackhole(table)
        }
    }
    return duration
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;where &#60;code&#62;tomlpp_parse&#60;/code&#62; is a minimal wrapper for the TOML++ library:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-cpp&#34;&#62;void *tomlpp_parse(const char *conf, size_t conf_len) {
    try {
        static toml::table table{};
        table = toml::parse(std::string_view{conf, conf_len});
        return static_cast&#38;lt;void *&#38;gt;(&#38;amp;table);
    } catch (...) {
        return nullptr;
    }
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;If any of these measures are unfair to the C/C++ libraries,
I&#39;d love your feedback!&#60;/p&#62;
&#60;/details&#62;
&#60;p&#62;Here&#39;s the output of the benchmark program I wrote:&#60;/p&#62;
&#60;pre&#62;&#60;code&#62;Benchmarking TOML parsers...
File size: 443461 bytes

Warming up...
Running 100 iterations...

Results:
═══════════════════════════════════════════════════════════
TOMLDecoder:
  Average: 1.232 ms
  Min:     1.203 ms
  Max:     1.332 ms

cTOML99:
  Average: 2.226 ms
  Min:     2.190 ms
  Max:     2.341 ms

cTOMLPlusPlus:
  Average: 6.107 ms
  Min:     6.038 ms
  Max:     6.377 ms

TOMLDecoder is 1.81x faster than cTOML99
TOMLDecoder is 4.96x faster than cTOMLPlusPlus
═══════════════════════════════════════════════════════════
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;I charted the wall clock time and instruction counts over the commit history.
You can see that the latest release is a lot faster than 0.4.1:&#60;/p&#62;
&#60;iframe id=&#34;benchmark-iframe&#34; src=&#34;/assets/2026/01/tomldecoder-0.4.3-improvements.html&#34; width=&#34;100%&#34; height=&#34;1200&#34; frameborder=&#34;0&#34; style=&#34;border: none; display: block; margin: 20px 0;&#34;&#62;&#60;/iframe&#62;
&#60;script&#62;
window.addEventListener(&#39;message&#39;, function(event) {
    if (event.data.type === &#39;resize&#39;) {
        const iframe = document.getElementById(&#39;benchmark-iframe&#39;);
        if (iframe) {
            iframe.style.height = event.data.height + &#39;px&#39;;
            iframe.style.transition = &#39;none&#39;;
        }
    }
});
&#60;/script&#62;
&#60;p&#62;... and, the majority of these commits are authored by AI! How did that happen?&#60;/p&#62;
&#60;h2&#62;It&#39;s old-fashioned engineering, baby!&#60;/h2&#62;
&#60;p&#62;I ended the &#60;a href=&#34;/2025/12/10/TOMLDecoder-0.4.1/&#34;&#62;last post&#60;/a&#62; with the following (emphasize in &#60;strong&#62;bold&#60;/strong&#62;):&#60;/p&#62;
&#60;blockquote&#62;
&#60;p&#62;...the project also gained a bunch of infra improvements.&#60;/p&#62;
&#60;ul&#62;
&#60;li&#62;It has a DocC-based documentation site.&#60;/li&#62;
&#60;li&#62;&#60;strong&#62;The entirety of the official test suite is now programmatically imported as unit tests.&#60;/strong&#62;&#60;/li&#62;
&#60;li&#62;&#60;strong&#62;The source code style is now enforced by swiftformat&#60;/strong&#62;&#60;/li&#62;
&#60;li&#62;Platform checks are more comprehensive and modern on CI.&#60;/li&#62;
&#60;li&#62;&#60;strong&#62;Benchmarks are now modernized with ordo-one/package-benchmark.&#60;/strong&#62;&#60;/li&#62;
&#60;/ul&#62;
&#60;/blockquote&#62;
&#60;p&#62;If you set out to optimize the runtime performance of a software project,
infra improvements like these will ensure that
your engineer can explore options for optimization with confidence that
they won&#39;t break the expected behavior,
and their efforts can be measured objectively.&#60;/p&#62;
&#60;p&#62;Most importantly,
as detailed in the last post,
the architecture of the TOML parser has received some significant upgrades.
This type of change is rare in a small project,
and I don&#39;t expect it to happen again in the next phase of optimization.&#60;/p&#62;
&#60;p&#62;I set up a separate project that calls into TOMLDecoder
so that I can profile it with Instruments.&#60;/p&#62;
&#60;p&#62;It was during the holidays, and
although the idea of trying my hand at micro-optimizing the code,
and gradually squeezing out performance juice sounded really fun,
I also had a bunch of travel planned.
So what else is there to do?&#60;/p&#62;
&#60;p&#62;I booted up codex.&#60;/p&#62;
&#60;h2&#62;gpt-5.2-codex, my performance engineer&#60;/h2&#62;
&#60;p&#62;For the most part,
I simply fed this prompt to codex over and over again:&#60;/p&#62;
&#60;blockquote&#62;
&#60;p&#62;Objective: Try to make the p50 of &#38;quot;parse twitter.toml&#38;quot; benchmark improve by &#38;gt; 1.1% on instructions or retains compared to the &#60;code&#62;main&#60;/code&#62; branch. Improvements on either is acceptable as a success, but regression in either should be considered a failure. Other metrics in the benchmark does not matter.&#60;/p&#62;
&#60;p&#62;Verify iteratively:&#60;/p&#62;
&#60;ol&#62;
&#60;li&#62;Make code changes&#60;/li&#62;
&#60;li&#62;Format with &#60;code&#62;make format&#60;/code&#62;.&#60;/li&#62;
&#60;li&#62;Make sure all tests passes by running &#60;code&#62;swift test&#60;/code&#62;.&#60;/li&#62;
&#60;li&#62;Create a branch prefixed with &#60;code&#62;cc/&#60;/code&#62; in name&#60;/li&#62;
&#60;li&#62;Commit all changes. Include description of the optimization as body of the commit message.&#60;/li&#62;
&#60;li&#62;Use Scripts/benchmark.sh to run the benchmark, recording its output in a text file&#60;/li&#62;
&#60;li&#62;If the benchmark result meets the improvement threshold, cherry-pick the change onto main. Otherwise, Commit the benchmark results file to the branch you created, switch back to main and start over.&#60;/li&#62;
&#60;/ol&#62;
&#60;p&#62;When you run the benchmark script, NEVER use &#60;code&#62;HEAD&#60;/code&#62; as its argument. Use explicit SHAs. Only use Scripts/benchmark.sh SHA_OF_BASE SHA_OF_TARGET to run the benchmarks. Do not try to run the underlying commands directly.&#60;/p&#62;
&#60;p&#62;You must NOT look at the content of Benchmarks/, or the content of Sources/Resources.&#60;/p&#62;
&#60;p&#62;To give you some direction, I&#39;ve profiled the parsing the twitter example, and included the inverted time profile call tree in /tmp/trace-tree.txt.&#60;/p&#62;
&#60;/blockquote&#62;
&#60;p&#62;The prompt changed gradually in these ways:&#60;/p&#62;
&#60;ol&#62;
&#60;li&#62;Wording became more streamlined as I figured out how gpt-5.2-codex interprets specific things.&#60;/li&#62;
&#60;li&#62;The optimization threshold decreased as the lower-hanging fruits got picked&#60;/li&#62;
&#60;li&#62;The benchmark to optimize changed a bunch of times because they have different data profiles&#60;/li&#62;
&#60;/ol&#62;
&#60;p&#62;Each time the optimization threshold is met,
I collect another time profile from instruments with the latest change,
and restart the session with the same prompt.&#60;/p&#62;
&#60;p&#62;I actually started the journey with gpt-5.1-codex-max.
It would find 5-10% improvements consecutively at the beginning.
Then it would start to struggle,
then I&#39;d switch to gpt-5.2-codex with the default &#38;quot;Medium&#38;quot; setting,
then &#38;quot;High&#38;quot;, and eventually &#38;quot;Extra high&#38;quot;.
Towards the end,
the LLM could barely find any speed improvements
without regressing other benchmarks in some way.
That&#39;s when I decided it&#39;s time to cut a release.&#60;/p&#62;
&#60;h2&#62;My observations of the model&#60;/h2&#62;
&#60;p&#62;Despite occasional struggles with conventional Swift coding style,
I find that gpt-5.2-codex is good at analyzing the flow of the parser,
and finds ways to short circuit certain logic.
These types of discoveries made the parser quite a bit faster.&#60;/p&#62;
&#60;p&#62;It replaced key comparisons in a hot loop with hash value comparisons,
which brought in a significant speedup.
In retrospect, this idea seemed fairly obvious,
but, in my imagination,
I wouldn&#39;t have been bold enough to try it.&#60;/p&#62;
&#60;p&#62;The LLM has a few favorite things to try at the start of each session.&#60;/p&#62;
&#60;ul&#62;
&#60;li&#62;It would see a linear search and try to replace it with a dictionary lookup&#60;/li&#62;
&#60;li&#62;It would try unrolling loops (in a few cases, this actually helped),&#60;/li&#62;
&#60;li&#62;It would reserve array capacities ahead of time,&#60;/li&#62;
&#60;li&#62;It would eliminate copies by converting things into classes&#60;/li&#62;
&#60;/ul&#62;
&#60;p&#62;But then the benchmark would regress,
which forces it to explore other paths.&#60;/p&#62;
&#60;p&#62;Although my prompt tells the model to not look at the benchmark itself,
it sometimes goes and does it anyways.
I suppose benchmark-maxing is a temptation too great to resist for it?&#60;/p&#62;
&#60;p&#62;As reported by &#60;a href=&#34;https://steipete.me/posts/2025/shipping-at-inference-speed&#34;&#62;others&#60;/a&#62;,
I also observed that gpt-5.2-codex would spend a lot of time just analyzing,
before attempting any changes.
The code change it produces is almost always one-shot.
It rarely goes back and revises the idea it&#39;s attempting to implement.&#60;/p&#62;
&#60;h2&#62;Conclusions&#60;/h2&#62;
&#60;p&#62;Good engineering practices continue to pay dividends with LLMs.&#60;/p&#62;
&#60;p&#62;TOMLDecoder reached a point where its runtime performance is itself a feature worth talking about.&#60;/p&#62;
&#60;p&#62;The setup of this project can serve as a benchmark for LLMs, I think?
Here&#39;s a prompt,
a concrete, measurable outcome represented by numbers,
a huge test suite.
How far can you push those numbers?&#60;/p&#62;
</description>
                <pubDate>Thu, 01 Jan 2026 11:07:05 -0800</pubDate>
                <link>https://duan.ca/2026/01/01/TOMLDecoder-Is-Faster-Than-C/</link>
                <guid isPermaLink="true">https://duan.ca/2026/01/01/TOMLDecoder-Is-Faster-Than-C/</guid>
            </item>
            <item>
                <title>TOMLDecoder 0.4 is 800% Faster</title>
                <description>&#60;p&#62;I just released version 0.4.1 of &#60;a href=&#34;https://github.com/dduan/TOMLDecoder&#34;&#62;TOMLDecoder&#60;/a&#62;,
a TOML 1.0 parser,
and &#60;a href=&#34;https://developer.apple.com/documentation/swift/codable&#34;&#62;decoder&#60;/a&#62; implemented in pure Swift.
When decoding a TOMLDocument such as &#60;a href=&#34;https://github.com/dduan/TOMLDecoder/blob/cea8f0bee33f37e0fcc33b566a742485c71196e7/Sources/Resources/fixtures/twitter.toml&#34;&#62;this twitter payload&#60;/a&#62;,
TOMLDecoder 0.4.1 is roughly 800% faster by wall clock time than 0.3.x.
In this post, I’ll discuss how this was achieved.&#60;/p&#62;
&#60;p&#62;&#60;em&#62;tl;dr: among other things,
the gains comes from making the parsing algorithm lazier,
and eliminating overheads from bound checking when accessing substrings.&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;&#60;em&#62;Update:
An earlier version of this post claimed that adopting Span eliminates cost of all bound checking
when accessing the underlying bytes of the TOML content,
that turns out to be wrong.
The reality is more interesting.
The post has been revised to discuss what really brought the performance gains
after adopting Span.&#60;/em&#62;&#60;/p&#62;
&#60;h2&#62;The Benchmark&#60;/h2&#62;
&#60;p&#62;TOMLDecoder now includes benchmarks implemented with &#60;a href=&#34;https://github.com/ordo-one/package-benchmark&#34;&#62;ordo-one/package-benchmark&#60;/a&#62;.
I plotted the median from the aforementioned benchmark results below.
Each chart includes data points for deserializing the TOML document,
and decoding it on top.
(Unsurprisingly, decoding takes a bit longer.)&#60;/p&#62;
&#60;p&#62;The results show
wall clock time,
CPU instructions,
as well as retain count all trending down significantly.&#60;/p&#62;
&#60;p&#62;In addition to the before and after,
there&#39;s an extra data point measured specifically prior to adopting Swift&#39;s &#60;code&#62;Span&#60;/code&#62;.
More on that later.&#60;/p&#62;
&#60;iframe id=&#34;benchmark-iframe&#34; src=&#34;/assets/2025/12/tomldecoder-0.4.0-benchmark-charts.html&#34; width=&#34;100%&#34; height=&#34;1200&#34; frameborder=&#34;0&#34; style=&#34;border: none; display: block; margin: 20px 0; min-height: 1200px;&#34;&#62;&#60;/iframe&#62;
&#60;script&#62;
window.addEventListener(&#39;message&#39;, function(event) {
    if (event.data.type === &#39;resize&#39;) {
        const iframe = document.getElementById(&#39;benchmark-iframe&#39;);
        if (iframe) {
            iframe.style.height = event.data.height + &#39;px&#39;;
            iframe.style.transition = &#39;none&#39;;
        }
    }
});
&#60;/script&#62;
&#60;h2&#62;How to make a parser go fast&#60;/h2&#62;
&#60;h3&#62;Improving data structure and algorithms&#60;/h3&#62;
&#60;p&#62;... also known as cheating.
Yes, really.&#60;/p&#62;
&#60;p&#62;In 0.3.x, &#60;code&#62;TOMLDecoder&#60;/code&#62; behaves like &#60;a href=&#34;https://developer.apple.com/documentation/foundation/jsonserialization&#34;&#62;JSONSerialization&#60;/a&#62;.
When you ask it to decode TOML data,
with &#60;code&#62;TOMLDecoder.tomlTable(from:)&#60;/code&#62;
it goes through the entire document,
creates matching container structures within it.
For each TOML table, it creates a &#60;code&#62;[String: Any]&#60;/code&#62;,
for each TOML array, it creates a &#60;code&#62;[Any]&#60;/code&#62;.
When a table contains an array,
for example,
a corresponding &#60;code&#62;[&#38;quot;key&#38;quot;: [...]]&#60;/code&#62; entry is created to match.
Along the way, the parser also validates the leaf types,
so things like a ill-formed date causes an error to be thrown.
The end result is a &#60;code&#62;[String: Any]&#60;/code&#62; in which
everything is known to be valid.&#60;/p&#62;
&#60;p&#62;A number of things are slow in this process:&#60;/p&#62;
&#60;ul&#62;
&#60;li&#62;The frequent creation and subsequent usage of intermediary Swift arrays and dictionaries require heap allocations.&#60;/li&#62;
&#60;li&#62;Validating every leaf value takes time.&#60;/li&#62;
&#60;li&#62;Retrieved values are &#60;code&#62;Any&#60;/code&#62;s, so you have to cast it to the expected type to consume them.&#60;/li&#62;
&#60;/ul&#62;
&#60;p&#62;TOMLDecoder 0.4 does away with all of that.&#60;/p&#62;
&#60;p&#62;To represent the containers,
and leaf values,
0.4 introduces some light-weight structs,
These structs don&#39;t manage the actual memory used to store their contents.
As the parser work through the bytes of a TOML document,
it creates these light weight data types to record the shape of the document,
as well as the byte-offsets of the leaf values.
These intermediary data are stored in a centralized location
to avoid unnecessary heap allocations.&#60;/p&#62;
&#60;p&#62;Here&#39;s what I mean by &#38;quot;cheating&#38;quot;:
during this phase,
the parser doesn&#39;t do much validation of the leaf values.
What it does is more akin to &#38;quot;lexing&#38;quot;,
it finds the tokens that could represent a leaf value,
and remembers where they are.
No work is done to actually validate and create the leaf values.&#60;/p&#62;
&#60;p&#62;To retrieve any values from the result,
you must state what type is expected:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;// a valid TOML document is always a table at the root level
let serverIP = try TOMLTable(source: tomlString)
	.string(forKey: &#38;quot;ip&#38;quot;) // validate this token as a `String`
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;This is an API change.
It delays the validation work,
and helps avoid conversions from &#60;code&#62;Any&#60;/code&#62;.
If you only need one field,
no validation is necessary on the rest of the leaf values in the entire document.&#60;/p&#62;
&#60;p&#62;Swift&#39;s decoding APIs ask for typed access:
if your &#60;code&#62;Codable&#60;/code&#62; type has a &#60;code&#62;Date&#60;/code&#62; field,
you ask the container for a &#60;code&#62;Date&#60;/code&#62;,
if the matching value at the spot is of a different type,
an error is thrown.
So the more efficient access pattern benefits the decoding process as well.&#60;/p&#62;
&#60;h3&#62;Eliminating bound checks&#60;/h3&#62;
&#60;p&#62;A major source of slowness in TOMLDecoder 0.3.x
comes from inefficient patterns when the underyling bytes of a TOML document.&#60;/p&#62;
&#60;p&#62;The parser holds a reference to the original string,
and hands &#60;code&#62;String.UTF8View.SubSequence&#60;/code&#62;s to small functions to descend on.
A typical piece of the parser might look like this:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;func skipWhitespaces(_ text: inout String.UTF8View.SubSequence) {
    let bytes = text.utf8
    var i = bytes.startIndex
    while i &#38;lt; bytes.endIndex {
        if !isWhitespace(bytes[i]) { // very slow!
            break
        }
        bytes.formIndex(after: &#38;amp;i)
    }
    text = bytes[i...]
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;Using UTF8View makes sure that we aren&#39;t dealing with &#60;code&#62;Character&#60;/code&#62;s,
which could have variable lengths.
However,
accessing the bytes in this way introduces multiple rounds of bound checks
that ends up being super expensive in the hot path of the parser:&#60;/p&#62;
&#60;ol&#62;
&#60;li&#62;The standard library needs to check that a index is valid for the &#60;code&#62;SubSequence&#60;/code&#62; aka &#60;code&#62;Substring&#60;/code&#62;
by comparing it against the start and end indices.&#60;/li&#62;
&#60;li&#62;Then, the index is used to access the underlying &#60;code&#62;UTF8View&#60;/code&#62;, at this point,
the standard library checks whether the index is out of bound again.&#60;/li&#62;
&#60;li&#62;A the end, the library goes into the buffer pointer of the string to retrieve the actual byte.&#60;/li&#62;
&#60;/ol&#62;
&#60;p&#62;(All of that assumes that the string&#39;s buffer is contiguously stored in memory.
There&#39;s a even slower path that I could eliminate by ensuring the string is native.)&#60;/p&#62;
&#60;p&#62;A parser does a whole lot of such accesses.
The cost of these bound checks seriously adds up.&#60;/p&#62;
&#60;p&#62;Since the release of TOMLDecoder 0.3.0,
Swift has gained a whole set of features that led to the introduction of &#60;a href=&#34;https://github.com/swiftlang/swift-evolution/blob/main/proposals/0447-span-access-shared-contiguous-storage.md&#34;&#62;Span&#60;/a&#62;.
&#60;code&#62;Span&#60;/code&#62; is built on compile-time lifetime checks.
These checks guarantee the safety when accessing its content.
The same function updated for &#60;code&#62;Span&#60;/code&#62; looks extremely similar to the original:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;func skipWhitespace(
    bytes: Span&#38;lt;UTF8.CodeUnit&#38;gt;, // aka Span&#38;lt;UInt8&#38;gt;
    remainingBytes: inout Range&#38;lt;Int&#38;gt;,
) {
    var i = remainingBytes.lowerBound
    while i &#38;lt; bytes.count {
        if !isWhitespace(bytes[i]) { break }
        i += 1
    }
    remainingBytes = i ..&#38;lt; remainingBytes.upperBound
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;Here,
the subscript access of &#60;code&#62;bytes&#60;/code&#62; does not incur multiple rounds of bound checks!
Rather, it skips step 1,
which eliminates 2 integer comparisons per access,
a 2/3 reduction in bound check overhead.
Further, the compiler can see the access pattern more clearly,
it can heuristically eliminate even the final remaning bound checks in some casse.&#60;/p&#62;
&#60;p&#62;Not having to perform all the bound checks
in the tight loop of the parser results significant performance gains
as shown in the benchmark results.&#60;/p&#62;
&#60;p&#62;&#60;em&#62;Here&#39;s the kicker&#60;/em&#62;.
With &#60;code&#62;Span&#60;/code&#62;,
the bound checks are eliminated
because the compiler is confident that the access is safe by construction.
If you make a mistake that would lead to unsafe access,
Swift will refuse to compile your code.
But &#60;code&#62;Span&#60;/code&#62; is a language feature that requires new language runtime.
You cannot use it on older operating systems.
There&#39;s other, older ways to avoid bound checks,
using &#60;code&#62;UnsafeBufferPointer&#60;/code&#62;s.
The problem of doing so is that you are responsible for ensuring that the access is safe.
In particular, the point of access must occur in a valid scope for the pointer.
A piece of parser using such API may look like this:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;func skipWhitespace(
    bytes: UnsafeBufferPointer&#38;lt;UTF8.CodeUnit&#38;gt;,
    remainingBytes: inout Range&#38;lt;Int&#38;gt;,
) {
    var i = remainingBytes.lowerBound
    while i &#38;lt; bytes.count {
        if !isWhitespace(bytes[i]) { break }
        i += 1
    }
    remainingBytes = i ..&#38;lt; remainingBytes.upperBound
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;But WAIT!  This code using the buffer pointer look extremely similar to the &#60;code&#62;Span&#60;/code&#62; version!
And if you think carefully,
the requirement for maintaining valid scope for the &#60;code&#62;UnsafeBufferPointer&#60;/code&#62; is already &#60;em&#62;enforced&#60;/em&#62; for any &#60;code&#62;Span&#60;/code&#62;s, syntactically!&#60;/p&#62;
&#60;p&#62;Enter &#60;a href=&#34;https://nshipster.com/swift-gyb/&#34;&#62;gyb&#60;/a&#62;. A script that Swift uses to generate repetitive code in the complier.
In TOMLDecoder 0.4,
the parser implementation uses it to generate 2 version of the same set of parsing logic:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;configs = [
    (&#38;quot;Span&#38;lt;UInt8&#38;gt;&#38;quot;, &#38;quot;@available(iOS 26, macOS 26, watchOS 26, tvOS 26, visionOS 26, *)&#38;quot;),
    (&#38;quot;UnsafeBufferPointer&#38;lt;UInt8&#38;gt;&#38;quot;, &#38;quot;@available(iOS 13, macOS 10.15, watchOS 6, tvOS 13, visionOS 1, *)&#38;quot;),
]
}%
% for byte_type, availability in configs:
${availability}
func parse(bytes: ${byte_type}) throws -&#38;gt; TOMLTable {
	// same code
}
% end
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;... and there&#39;s a single place that checks for the OS at runtime:&#60;/p&#62;
&#60;pre&#62;&#60;code class=&#34;language-swift&#34;&#62;let source: String = ... // TOML string
if #available(iOS 26, macOS 26, watchOS 26, tvOS 26, visionOS 26, *) {
    let bytes = source.utf8Span.span
    try parse(bytes: bytes)
} else {
    try source.withUTF8 { try parse(bytes: $0) }
}
&#60;/code&#62;&#60;/pre&#62;
&#60;p&#62;The beauty here,
is that the compiler does all the work to ensure the access to the &#60;code&#62;Span&#60;/code&#62;
as well as the buffer pointer are safe,
because the logic that does the accessing are identical thanks to &#60;code&#62;gyb&#60;/code&#62;.&#60;/p&#62;
&#60;h2&#62;Conclusion&#60;/h2&#62;
&#60;p&#62;In reality, there are a ton of other optimizations applied in TOMLDecoder 0.4.
For example,
instead of doing dictionary look ups,
looking up things from a TOMLDocument actually involves a linear search.
I know, I know, this goes against what we were taught in CS.
But in modern computers,
and for typical sizes of TOML documents,
a linear search is often faster that computing a hash value,
and the subsequent lookups.&#60;/p&#62;
&#60;p&#62;As part of the release,
the project also gained a bunch of infra improvements.&#60;/p&#62;
&#60;ul&#62;
&#60;li&#62;It has a &#60;a href=&#34;https://www.swift.org/documentation/docc/&#34;&#62;DocC&#60;/a&#62;-based &#60;a href=&#34;https://dduan.github.io/TOMLDecoder/main/documentation/tomldecoder/&#34;&#62;documentation site&#60;/a&#62;.&#60;/li&#62;
&#60;li&#62;The entirety of the &#60;a href=&#34;https://github.com/toml-lang/toml-test&#34;&#62;official test suite&#60;/a&#62; is now programmatically imported as unit tests.&#60;/li&#62;
&#60;li&#62;The source code style is now enforced by &#60;a href=&#34;https://github.com/nicklockwood/SwiftFormat&#34;&#62;swiftformat&#60;/a&#62;.&#60;/li&#62;
&#60;li&#62;Platform checks are more comprehensive and modern on CI.&#60;/li&#62;
&#60;li&#62;Benchmarks are now modernized with &#60;a href=&#34;https://github.com/ordo-one/package-benchmark&#34;&#62;ordo-one/package-benchmark&#60;/a&#62;&#60;/li&#62;
&#60;/ul&#62;
&#60;p&#62;I think of this release as a preparation for a eventual 1.0 release,
which will support the &#60;a href=&#34;https://forums.swift.org/t/the-future-of-serialization-deserialization-apis/78585/171&#34;&#62;new deserialization APIs from Swift&#60;/a&#62;.&#60;/p&#62;
&#60;p&#62;Even through I went through some optimizations for speed in this post,
I still have a bunch of ideas I want to try to squeeze out more performance gains.
That&#39;s exciting.&#60;/p&#62;
</description>
                <pubDate>Wed, 10 Dec 2025 17:44:34 -0800</pubDate>
                <link>https://duan.ca/2025/12/10/TOMLDecoder-0.4.1/</link>
                <guid isPermaLink="true">https://duan.ca/2025/12/10/TOMLDecoder-0.4.1/</guid>
            </item>
    </channel>
</rss>